# **Pandas - Python Library**

### **What is Pandas ?**

- **Pandas is a python library which helps in Data Analysis**
- **Used to analyze data**
- **Pandas has functions for analyzing, cleaning, exploring and manipulating data**

### **Why Use Pandas ?**

- **It helps to anlyze big data**
- **Makes conclusion on statistical theories**
- **Pandas can clean messy datasets**
- **Pandas methods can delete rows, clean data, rename columns, access specific data, filter data**

### **Installing Pandas**

## **- Pandas Series**

- **A pandas series is like column in a table**
- **It is one dimensional array holding data of any type**

## **- Pandas Dataframes**

- **Data sets in Pandas are usually multi-dimensional tables, called DataFrames.**
- **Series is like a column, a DataFrame is the whole table.**
- **Pandas dataframe is like 2 dimensional data structure, like two dimensional array, or table with columns and rows**
- **Dataframe is like table with columns and rows**

### **Import Pandas**

In [None]:
# importing pandas library so we can use its methods in our notebook
import pandas as pd     # here we create alias name for pandas library which is pd

### **Load Files into DataFrame**

#### **Pandas - Read CSV file**

In [None]:
# loading given dataset using read_csv() method of pandas 
# here df is a variable that stores data which we reading from this student_csv dataset
df = pd.read_csv("C:\\Users\\HP\\Documents\\student_dataset.csv")

## **Functions -**

### **1. shape** - Returns number of rows and columns

In [None]:
# how many rows and columns are there
df.shape

- **In this dataset their are 150 rows and 8 columns**

### **2. head()** - Used to view rows

In [None]:
# display the first 3 rows
df.head(3)

In [None]:
# display first 5 rows
df.head()    # head() fuction display 5 rows in default 

### **3. columns** - Used to view column names

In [None]:
# display the column names
df.columns

### **4. dtypes** - Used to view datatype of each column

In [None]:
# display datatypes of all columns
df.dtypes

1. **In this dataset StudentID, Age, Marks are of integer datatype**
2. **And Name, Gender, Course, Grade, City are of object datatype**

### **5. isnull()** - For checking missing values

In [None]:
df.isnull()   # returns True where values are NAN and false otherwise and gives boolean output

In [None]:
# check for missing values 
df.isnull().sum()    # isnull() method checks if null and sum() method gives count of null values for each column

### **6. duplicated()** - Used to check duplicates

In [None]:
# check duplicate or not for each row
df.duplicated()    # check each row is duplicate or not and return boolean output

In [None]:
# get how many duplicate records in this dataset
df.duplicated().sum()   

In [None]:
df['Name'].duplicated().sum()     # duplicates in Name column

### **7. describe()** - Used to get statistical summary of dataframe

In [None]:
# get the basic statistical summary of numerical columns
df.describe()

In [None]:
# get the basic statistical summary of data of numerical and categorical columns
df.describe(include = "all")

- **This dataset has 150 entries with no missing and duplicate values**
- **The average age of the students is 24**
- **The minimum age of the students is 18**
- **The maximum age of the students is 30**
- **The average marks of the students is 67**
- **The minimum marks of the students is 40**
- **The maximum marks of the students is 100**

### **8. unique()** - Used to get unique values from a specific column 

In [None]:
# display unique values in a name column
df['Name'].unique()

In [None]:
# display the unique values in gender column
df['Gender'].unique()

In [None]:
# display the unique values in course column
df['Course'].unique()

In [None]:
# display the unique values in city column
df['City'].unique()

### **9. value_counts()** - Count the number of occurences of each unique value in a column

In [None]:
# count the number of students in each course 
df['Course'].value_counts()

In [None]:
# count the number of students of each gender
df['Gender'].value_counts()

In [None]:
# find no of students from each city
df['City'].value_counts()

In [None]:
# find the count of each grade
df['Grade'].value_counts()

### **10. nunique()** - Count no of unique values in a column

In [None]:
# find the number of unique cities
df['City'].nunique()

In [None]:
# find the number of unique Gender
df['Gender'].nunique()

In [None]:
# find the number of unique Grades
df['Grade'].nunique()

In [None]:
# find the number of unique courses
df['Course'].nunique()

In [None]:
# find the number of unique result
df['Marks'].nunique()

### **11. info()** -  Used to get summary of dataframe

In [None]:
df.info()

### **12. max() / min()** - Returns maximum and minimum values

In [None]:
# what is maximum and minimum marks obtained by the students
print("Max marks are :", df['Marks'].max())
print("Min marks are :", df['Marks'].min())

### **13. sum()** - Used to calculate sum of elements in a dataframe

In [None]:
# how many students got grade "A"
(df['Grade'] == "A").sum()

In [None]:
# how many students are failed
(df['Grade'] == "F").sum()

In [None]:
# how many females are
(df['Gender'] == 'Female').sum()

### **14. len()** - length or entries in dataframe

In [None]:
len(df)

### **15. index()** - Returns the row index

In [None]:
df.index

### **16. rename()** - rename the specified column or row

In [None]:
df.rename(columns = {'Score':'Marks'}, inplace = True)  # change score column to marks

In [None]:
df.rename(columns = {"StudentID":"ID"}, inplace = True)

In [None]:
df.head()    # display first five rows

### **17. sort_values()** - Sort values

In [None]:
# sort the records by age in descending
df.sort_values(by = 'Age', ascending = False).head()

In [None]:
# sort the records by marks descending
df.sort_values(by = 'Marks', ascending = False).head()

In [None]:
# sort the records by city
df.sort_values(by = "City", ascending = True).head()

In [None]:
# axis = 0 for rows
# axis = 1 for columns

### **18. Filtering** - Selecting rows and columns that meets condition

In [None]:
# Show the details of students from city hyderabad
df[df['City'] == 'Hyderabad'].head()

In [None]:
# show the details of females
df[df['Gender'] == "Female"].head()

In [None]:
# show the name of all students who scored more than 80
df[df['Marks'] > 80].head()

In [None]:
# get records of age less than 20 
df[df['Age'] < 20].head()

In [None]:
# get students whose course is data science
df[df['Course'] == 'Data Science'].head()

### **19. loc()** - Used to access rows and columns (Label - based - indexing)

In [None]:
# get name of the students with course name and return 5 records
df.loc[:4,['Name','Course']]

In [None]:
# get first 3 columns with all records
df.loc[:,['Name','Course','Gender']].head()

### **20. iloc()** - Used to access rows and columns (position - based - indexing)

In [None]:
# get info of third record 
df.iloc[2]   # its index is 2

In [None]:
# access records 6 with first five columns 
df.iloc[0:6, 0:5]

### **21. Add column**

In [None]:
df["Grace_score"] = df["Marks"] + 10    # add new column to dataset

In [None]:
df.head()

### **22. Drop Column**

In [None]:
df.drop("Grace_score", axis = 1, inplace = True)

In [None]:
df.head()

### **23. Rank()** - Rank values in a column. Assigning numerical position based on ranks

In [None]:
df["Marks"].rank(ascending = False).head()   # index is not changed

### **24. astype()** - Converts the datatype of column

In [None]:
df["Marks"].astype(float).head()

### **25. apply()** - Used to apply function to each row or column 

In [None]:
df["Result"] = df["Marks"].apply(lambda x : "Pass" if x >= 50 else "Fail")

In [None]:
df['Top'] = df['Marks'].apply(lambda x : 'TOPPER' if x == 100 else '-') 

In [None]:
df.head()

### **26. query()** - Uses sql like queries to filter data 

In [None]:
df.query("Gender == 'Female' and Marks > 75").head()

### **27. pivot_table()** - Summarizes data by using aggregating functions

In [None]:
df.pivot_table(values = "Marks", index = "Course", columns = "Gender", aggfunc = "mean")    

In [None]:
df.pivot_table(values = "Marks", index = "Course", columns = "City", aggfunc = "max")

### **28. tail()** - Returns last five rows

In [None]:
df.tail()

### **29. group_by()** - Used to group data based on one or more columns and then apply a aggregate function 

In [None]:
# what is the average marks for each course
df.groupby('Course')['Marks'].mean()

In [None]:
df.groupby('Gender')['Marks'].mean()

### **30. nlargest** - Top n rows with highest values

In [None]:
df.nlargest(5,"Marks")  

### **31. nsmallest** - Top n rows with smallest values

In [None]:
df.nsmallest(5,"Age")