# 📓 Lesson 3: Reading Data from Files
📘 What you will learn:
1. How to read data from different file types
2. How to use functions like read_csv(), read_excel(), read_json()
3. How to explore and understand the data using head(), tail(), info(), and describe()

## 📂 Step 1: Read a CSV File
The most common format for data is CSV (Comma Separated Values).
Let’s read a file called Sales_January_2019.csv.

! Make sure the file is in your data/ folder.

In [4]:
import pandas as pd

# Read a CSV file from the data folder
csv_df = pd.read_csv('../data/Sales_January_2019.csv')

# Show the first 5 rows of the dataset
print(csv_df.head())

  Order ID                   Product Quantity Ordered Price Each  \
0   141234                    iPhone                1        700   
1   141235  Lightning Charging Cable                1      14.95   
2   141236          Wired Headphones                2      11.99   
3   141237          27in FHD Monitor                1     149.99   
4   141238          Wired Headphones                1      11.99   

       Order Date                       Purchase Address  
0  01/22/19 21:25        944 Walnut St, Boston, MA 02215  
1  01/28/19 14:15       185 Maple St, Portland, OR 97035  
2  01/17/19 13:33  538 Adams St, San Francisco, CA 94016  
3  01/05/19 20:33     738 10th St, Los Angeles, CA 90001  
4  01/25/19 11:59          387 10th St, Austin, TX 73301  


📌 What this does:

- Loads the file into a DataFrame
- Automatically detects column names and values
- Displays the top 5 rows to give you an idea of the content

## 📊 Step 2: Reading Excel Files
If your data is in .xlsx format (Excel), you can read it using read_excel().

! You need to install the openpyxl package to work with .xlsx files:

In [None]:
pip install openpyxl




In [5]:
# Read Excel file from the same dataset
excel_df = pd.read_excel('../data/Sales_January_2019.xlsx')

# Show the first 5 rows
print(excel_df.head())

  Order ID                   Product Quantity Ordered Price Each  \
0   141234                    iPhone                1        700   
1   141235  Lightning Charging Cable                1      14.95   
2   141236          Wired Headphones                2      11.99   
3   141237          27in FHD Monitor                1     149.99   
4   141238          Wired Headphones                1      11.99   

       Order Date                       Purchase Address  
0  01/22/19 21:25        944 Walnut St, Boston, MA 02215  
1  01/28/19 14:15       185 Maple St, Portland, OR 97035  
2  01/17/19 13:33  538 Adams St, San Francisco, CA 94016  
3  01/05/19 20:33     738 10th St, Los Angeles, CA 90001  
4  01/25/19 11:59          387 10th St, Austin, TX 73301  


📌 Note: Excel files can have multiple sheets.

You can use sheet_name='Sheet1' to select a specific one.

In [None]:
# Read a specific sheet from Excel (if there is more than one)
df_sheet = pd.read_excel('../data/example_with_sheets.xlsx', sheet_name='January Sales')

# Show the first few rows
print(df_sheet.head())

📌 Tip: You can also use sheet_name=None to read all sheets as a dictionary:

In [None]:
# Read all sheets into a dictionary of DataFrames
all_sheets = pd.read_excel('../data/example_with_sheets.xlsx', sheet_name=None)

# Check available sheet names
print("Sheets:", all_sheets.keys())

# Access one of them
print(all_sheets['January Sales'].head())

## 🌐 Step 3: Reading JSON Files
JSON (JavaScript Object Notation) is commonly used in APIs and web apps.

In [9]:
# Read JSON file (already structured in records format)
json_df = pd.read_json('../data/Sales_January_2019.json', lines=True)

# Display the first 5 rows
print(json_df.head())


  Order ID                   Product Quantity Ordered Price Each  \
0   141234                    iPhone                1        700   
1   141235  Lightning Charging Cable                1      14.95   
2   141236          Wired Headphones                2      11.99   
3   141237          27in FHD Monitor                1     149.99   
4   141238          Wired Headphones                1      11.99   

       Order Date                       Purchase Address  
0  01/22/19 21:25        944 Walnut St, Boston, MA 02215  
1  01/28/19 14:15       185 Maple St, Portland, OR 97035  
2  01/17/19 13:33  538 Adams St, San Francisco, CA 94016  
3  01/05/19 20:33     738 10th St, Los Angeles, CA 90001  
4  01/25/19 11:59          387 10th St, Austin, TX 73301  


📌 The "lines=True" parameter tells Pandas that the file contains one JSON record per line. This is often called "JSON Lines" format.

💡 Note:
If your JSON file is structured as a complete list (starts with [ and ends with ]), you do not need to use "lines=True".

📄 Example of standard JSON (array format):

In [None]:
[
  {"Name": "Ali", "Age": 25},
  {"Name": "Sara", "Age": 30}
]

In [15]:
# Read JSON file 
df = pd.read_json('../data/json-array-format.json')

# Display the first 5 rows
print(json_df.head())

  Order ID                   Product Quantity Ordered Price Each  \
0   141234                    iPhone                1        700   
1   141235  Lightning Charging Cable                1      14.95   
2   141236          Wired Headphones                2      11.99   
3   141237          27in FHD Monitor                1     149.99   
4   141238          Wired Headphones                1      11.99   

       Order Date                       Purchase Address  
0  01/22/19 21:25        944 Walnut St, Boston, MA 02215  
1  01/28/19 14:15       185 Maple St, Portland, OR 97035  
2  01/17/19 13:33  538 Adams St, San Francisco, CA 94016  
3  01/05/19 20:33     738 10th St, Los Angeles, CA 90001  
4  01/25/19 11:59          387 10th St, Austin, TX 73301  


But for JSON Lines format (each record on a new line), use:

In [13]:
json_df = pd.read_json('../data/Sales_January_2019.json', lines=True)

## 🧪 Step 4: Exploring the Dataset
After loading the data, you should explore its structure and content:

In [16]:
# Show first 5 rows
print(csv_df.head())

# Show last 5 rows
print(csv_df.tail())

# Show number of rows and columns
print("Shape:", csv_df.shape)

# Show column names and data types
print("Info:")
print(csv_df.info())

# Get summary statistics for numerical columns
print("Describe:")
print(csv_df.describe())

  Order ID                   Product Quantity Ordered Price Each  \
0   141234                    iPhone                1        700   
1   141235  Lightning Charging Cable                1      14.95   
2   141236          Wired Headphones                2      11.99   
3   141237          27in FHD Monitor                1     149.99   
4   141238          Wired Headphones                1      11.99   

       Order Date                       Purchase Address  
0  01/22/19 21:25        944 Walnut St, Boston, MA 02215  
1  01/28/19 14:15       185 Maple St, Portland, OR 97035  
2  01/17/19 13:33  538 Adams St, San Francisco, CA 94016  
3  01/05/19 20:33     738 10th St, Los Angeles, CA 90001  
4  01/25/19 11:59          387 10th St, Austin, TX 73301  
     Order ID                 Product Quantity Ordered Price Each  \
9718   150497            20in Monitor                1     109.99   
9719   150498        27in FHD Monitor                1     149.99   
9720   150499         ThinkPad

## 🧠 Practice Exercises
Use the file Sales_January_2019 in three formats (CSV, XLSX, JSON) and:
1. Load each format using the proper Pandas function
2. Print the first 10 rows
3. Print the number of rows and columns
4. Print the column names
5. Compare whether all formats load the same data correctly

In [18]:
# Load CSV
df_csv = pd.read_csv('../data/Sales_January_2019.csv')
print("CSV Sample:\n", df_csv.head(10))
print("CSV Shape:", df_csv.shape)

# Load Excel
df_excel = pd.read_excel('../data/Sales_January_2019.xlsx')
print("Excel Sample:\n", df_excel.head(10))
print("Excel Shape:", df_excel.shape)

# Load JSON
df_json = pd.read_json('../data/Sales_January_2019.json', lines=True)
print("JSON Sample:\n", df_json.head(10))
print("JSON Shape:", df_json.shape)


CSV Sample:
   Order ID                     Product Quantity Ordered Price Each  \
0   141234                      iPhone                1        700   
1   141235    Lightning Charging Cable                1      14.95   
2   141236            Wired Headphones                2      11.99   
3   141237            27in FHD Monitor                1     149.99   
4   141238            Wired Headphones                1      11.99   
5   141239      AAA Batteries (4-pack)                1       2.99   
6   141240      27in 4K Gaming Monitor                1     389.99   
7   141241        USB-C Charging Cable                1      11.95   
8   141242  Bose SoundSport Headphones                1      99.99   
9   141243    Apple Airpods Headphones                1        150   

       Order Date                         Purchase Address  
0  01/22/19 21:25          944 Walnut St, Boston, MA 02215  
1  01/28/19 14:15         185 Maple St, Portland, OR 97035  
2  01/17/19 13:33    538 Adams St

## 📌 Summary
In this lesson, you learned how to:
- Load different types of data files (CSV, Excel, JSON)
- Use Pandas functions like read_csv(), read_excel(), read_json()
- Explore the structure and contents of a dataset with head(), info(), and describe()

👉 In the next lesson, you’ll learn how to filter data, select specific columns or rows, and work with conditions using loc, iloc, and query().