In [2]:
import pandas as pd

# Reading Data (CSV, Excel, JSON)  

## 1. Read CSV file - `pd.read_csv()`

In [3]:
df = pd.read_csv("student.csv")
df.head()

Unnamed: 0,id,name,class,mark,gender
0,1,John Deo,Four,75,female
1,2,Max Ruin,Three,85,male
2,3,Arnold,Three,55,male
3,4,Krish Star,Four,60,female
4,5,John Mike,Four,60,female


1. If the CSV file have different saperator (like ; or anything else)  
   `df = pd.read_csv("student.csv", sep=";")`
2. If the first row is not a header  
   `df = pd.read_csv("student.csv", header=None)`
3. Custom column name  
   `df = pd.read_csv("student.csv", names=["A", "B", "C",....])`
4. If we want the any specific column as index  
   `df = pd.read_csv("student.csv", index_col="col_name")`
5. Handling missing values (the included string are considered as NaN values for ex in below ex. ["?", "-", "NA"])  
   `df = pd.read_csv("student.csv", na_values=["?", "-", "NA"])`

In [4]:
# 1


In [5]:
# 2 -> first row is not a headers
pd.read_csv("data.csv", header=None).head()

Unnamed: 0,0,1,2,3,4
0,1,John Deo,Four,75,female
1,2,Max Ruin,Three,85,male
2,3,Arnold,Three,55,male
3,4,Krish Star,Four,60,female
4,5,John Mike,Four,60,female


In [6]:
# 3 -> Can give custom column name
pd.read_csv("data.csv", header=None, names=["A","B","C","D","E"]).head()

Unnamed: 0,A,B,C,D,E
0,1,John Deo,Four,75,female
1,2,Max Ruin,Three,85,male
2,3,Arnold,Three,55,male
3,4,Krish Star,Four,60,female
4,5,John Mike,Four,60,female


In [7]:
# 4 -> choose column as an index
pd.read_csv("data.csv", header=None, names=["A","B","C","D","E"], index_col="A").head()

Unnamed: 0_level_0,B,C,D,E
A,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,John Deo,Four,75,female
2,Max Ruin,Three,85,male
3,Arnold,Three,55,male
4,Krish Star,Four,60,female
5,John Mike,Four,60,female


In [8]:
# 5 -> Handle null values
# For only understanding lets just consider the "male" this string is the empty or null in data So,

pd.read_csv("data.csv", header=None, na_values=["male"]).head()

Unnamed: 0,0,1,2,3,4
0,1,John Deo,Four,75,female
1,2,Max Ruin,Three,85,
2,3,Arnold,Three,55,
3,4,Krish Star,Four,60,female
4,5,John Mike,Four,60,female


All the cells with "male" data are marked as NaN.

## 2. Reading excel files - `pd.read_excel()`

In [10]:
pd.read_excel("student.xlsx").head()

Unnamed: 0,id,name,class,mark,gender
0,1,John Deo,Four,75,female
1,2,Max Ruin,Three,85,male
2,3,Arnold,Three,55,male
3,4,Krish Star,Four,60,female
4,5,John Mike,Four,60,female


For a particular sheet in the excel file we use `sheet_name=""` parameter

In [11]:
pd.read_excel("student.xlsx", sheet_name="student").head()

Unnamed: 0,id,name,class,mark,gender
0,1,John Deo,Four,75,female
1,2,Max Ruin,Three,85,male
2,3,Arnold,Three,55,male
3,4,Krish Star,Four,60,female
4,5,John Mike,Four,60,female


## 3. Reading JSON files - `pd.read_json()`

In [13]:
pd.read_json("student.json").head()

Unnamed: 0,id,name,class,mark,gender
0,1,John Deo,Four,75,female
1,2,Max Ruin,Three,85,male
2,3,Arnold,Three,55,male
3,4,Krish Star,Four,60,female
4,5,John Mike,Four,60,female


## Tip  
Alwaye check first few rows after loading it to code.  
This will confirms data is loaded perfectly and the columns, rows, and column_names are correct.  
Use these:
- `.head()`
- `.info()`
- `.describe()`

# Practice

1. Create a small CSV file yourself (3â€“4 rows) and load it with `pd.read_csv()`.
2. Load the same CSV with a custom separator (e.g., `;`) by modifying the file.
3. Try `index_col=0` and see how the DataFrame changes.

## 1

In [20]:
# 1
practice_df = pd.read_csv("practice_file.csv")
practice_df.head(2)

Unnamed: 0,No; Name; Age; City
0,1; Aaa; 21; Satara
1,2; Bbb; 23; Pune


## 2

In [21]:
# 2
df2 = pd.read_csv("practice_file.csv", sep=";")
df2.head(2)

Unnamed: 0,No,Name,Age,City
0,1,Aaa,21,Satara
1,2,Bbb,23,Pune


## 3

In [22]:
# 3
pd.read_csv("practice_file.csv", sep=";", index_col=0)

Unnamed: 0_level_0,Name,Age,City
No,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,Aaa,21,Satara
2,Bbb,23,Pune
3,Ccc,34,Karad
4,Ddd,32,Mumbai
5,Eee,54,Diskal


The `index_col=` is taking 0th column as per indexing