<a href="https://colab.research.google.com/github/Chitha19/Azure/blob/main/Copy_of_Data_Preparation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Preparation
Upload file mystudents.csv 

#1.Data Cleansing or Cleaning
**Ref:**
* [ Pandas API](https://pandas.pydata.org/docs/reference/index.html)
* [ DataFrame](https://pandas.pydata.org/docs/reference/frame.html)
*  [Meidum: Data cleaning](https://medium.com/@info_46914/4-%E0%B8%82%E0%B8%B1%E0%B9%89%E0%B8%99%E0%B8%95%E0%B8%AD%E0%B8%99%E0%B8%81%E0%B8%B2%E0%B8%A3-clean-data-%E0%B8%AA%E0%B8%B3%E0%B8%84%E0%B8%B1%E0%B8%8D%E0%B9%84%E0%B8%89%E0%B8%99-why-data-quality-is-a-king-eb924a8e7d7e)
* [Tutorialpoint: Data Cleansing](https://www.tutorialspoint.com/python_data_science/python_data_cleansing.htm)
*[TowardsDataScience: Data cleaning](https://towardsdatascience.com/the-ultimate-guide-to-data-cleaning-3969843991d4)



#1.1 Data Missing
Handing missing values, such as NA or NAN (Not a Number), using Pandas

In [None]:
import pandas as pd
import numpy as np

student_dict = {'Name': ['Joe', 'Sam', 'Harry'], 'Age': [20, 21, 19], 'Mark': [85.10, np.nan, 91.54]}

# Create DataFrame from dict
df = pd.DataFrame(student_dict)
df

Unnamed: 0,Name,Age,Mark
0,Joe,20,85.1
1,Sam,21,
2,Harry,19,91.54


##1.Check for Missing Values
To make detecting missing values easier (and across different array dtypes), Pandas provides the **isnull()** and **notnull()** functions, which are also methods on Series and DataFrame objects.

In [None]:
df.notnull()

Unnamed: 0,Name,Age,Mark
0,True,True,True
1,True,True,False
2,True,True,True


In [None]:
df.isnull()

Unnamed: 0,Name,Age,Mark
0,False,False,False
1,False,False,True
2,False,False,False


In [None]:
df['Mark'].isnull()

0    False
1     True
2    False
Name: Mark, dtype: bool

In [None]:
df['Mark'].isnull().any()

True

In [None]:
df['Mark'].notnull().all()

False

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Name    3 non-null      object 
 1   Age     3 non-null      int64  
 2   Mark    2 non-null      float64
dtypes: float64(1), int64(1), object(1)
memory usage: 200.0+ bytes


In [None]:
df.isna().sum()

Name    0
Age     0
Mark    1
dtype: int64

In [None]:
# Checking the missing values
df.isnull().sum()

Name    0
Age     0
Mark    1
dtype: int64

In [None]:
#see data types in each colums
df.dtypes

Name     object
Age       int64
Mark    float64
dtype: object

##2.Fill Missing Values
The fillna function can “fill in” NA values with non-null data in a couple of ways.

**DataFrame.fillna(value=None, method=None, axis=None, inplace=False, limit=None, downcast=None)**

Fill NA/NaN values using the specified method.

**Ref:**
[Pandas fillna doc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html)

###Ex1

In [None]:
student_dict = {'Name': ['Joe', 'Sam', 'Harry'], 'Age': [20, 21, 19], 'Mark': [85.10, np.nan, 91.54]}

df = pd.DataFrame(student_dict)
df

Unnamed: 0,Name,Age,Mark
0,Joe,20,85.1
1,Sam,21,
2,Harry,19,91.54


In [None]:
print("Replace NaN with '0':")
df1 = df.fillna(0)
df1

Replace NaN with '0':


Unnamed: 0,Name,Age,Mark
0,Joe,20,85.1
1,Sam,21,0.0
2,Harry,19,91.54


###Ex2

**Impute**
Imputation means to calculate the missing value based on other observations. There are quite a lot of methods to do that.

— **First** one is using statistical values like mean, median. However, none of these guarantees unbiased data, especially if there are many missing values

Mean is most useful when the original data is not skewed, while the median is more robust, not sensitive to outliers, and thus used when data is skewed.

In a normally distributed data, one can get all the values that are within 2 standard deviations from the mean. Next, fill in the missing values by generating random numbers between (mean — 2 * std) & (mean + 2 * std)

— **Second**. Using a linear regression. Based on the existing data, one can calculate the best fit line between two variables, say, house price vs. size m².
It is worth mentioning that linear regression models are sensitive to outliers.

In [None]:
df.describe()

Unnamed: 0,Age,Mark
count,3.0,2.0
mean,20.0,88.32
std,1.0,4.553768
min,19.0,85.1
25%,19.5,86.71
50%,20.0,88.32
75%,20.5,89.93
max,21.0,91.54


In [None]:
print(df.mean())
print("\nReplace NaN with means:")
df1 = df.fillna(df.mean())
df1

Age     20.00
Mark    88.32
dtype: float64

Replace NaN with means:


Unnamed: 0,Name,Age,Mark
0,Joe,20,85.1
1,Sam,21,88.32
2,Harry,19,91.54


In [None]:
df1 = df.fillna(df.mean())
print('df1 is')
print(df1)

print('\ndf is')
print(df)

df1 is
    Name  Age   Mark
0    Joe   20  85.10
1    Sam   21  88.32
2  Harry   19  91.54

df is
    Name  Age   Mark
0    Joe   20  85.10
1    Sam   21    NaN
2  Harry   19  91.54


In [None]:
print('\nBefore: df is')
print(df)
df.fillna(df.mean(), inplace=True)
print('\nAfter: df is')
print(df)


Before: df is
    Name  Age   Mark
0    Joe   20  85.10
1    Sam   21    NaN
2  Harry   19  91.54

After: df is
    Name  Age   Mark
0    Joe   20  85.10
1    Sam   21  88.32
2  Harry   19  91.54


###Ex3

In [None]:
student_dict = {'Name': ['Joe', 'Sam', 'Harry'], 'Age': [20, 21, 19], 'Mark': [85.10, np.nan, 91.54]}

df = pd.DataFrame(student_dict)
df

Unnamed: 0,Name,Age,Mark
0,Joe,20,85.1
1,Sam,21,
2,Harry,19,91.54


In [None]:
change_dict = {'Name': 'John Doe', 'Mark': -100, 'Total Mark': 0}
df1 = df.fillna(value=change_dict)
df1

Unnamed: 0,Name,Age,Mark
0,Joe,20,85.1
1,Sam,21,-100.0
2,Harry,19,91.54


###Ex4

In [None]:
#Using reindexing, we have created a DataFrame with missing values. 
df1 = df.reindex([0, 1, 2, 3])
df1

Unnamed: 0,Name,Age,Mark
0,Joe,20.0,85.1
1,Sam,21.0,
2,Harry,19.0,91.54
3,,,


In [None]:
change_dict = {'Name': 'John Doe', 'Mark': -100, 'Total Mark': 0}

df1 = df1.fillna(value=change_dict)
df1

Unnamed: 0,Name,Age,Mark
0,Joe,20.0,85.1
1,Sam,21.0,-100.0
2,Harry,19.0,91.54
3,John Doe,,-100.0


##3.Replace Missing (or) Generic Values
Replace a generic value with some specific value, using the **replace** method.

###Ex1

In [None]:
temp_dict = {'one':[1,2,3,4,5,2000], 'two':[1000,np.nan,3,4,5,6]}
df = pd.DataFrame(temp_dict)
df

Unnamed: 0,one,two
0,1,1000.0
1,2,
2,3,3.0
3,4,4.0
4,5,5.0
5,2000,6.0


In [None]:
df1 = df.replace({1000:-10})
df1

Unnamed: 0,one,two
0,1,-10.0
1,2,
2,3,3.0
3,4,4.0
4,5,5.0
5,2000,6.0


In [None]:
df2 = df.replace({1000:-10, 2000:-20})
df2

Unnamed: 0,one,two
0,1,-10.0
1,2,
2,3,3.0
3,4,4.0
4,5,5.0
5,-20,6.0


###Ex2

In [None]:
df.one.mean()

335.8333333333333

In [None]:
df3 = df.replace({1000:df.one.mean()})
df3

Unnamed: 0,one,two
0,1,335.833333
1,2,
2,3,3.0
3,4,4.0
4,5,5.0
5,2000,6.0


In [None]:
df4 = df.replace({np.nan:df.one.mean()})
df4

Unnamed: 0,one,two
0,1,1000.0
1,2,335.833333
2,3,3.0
3,4,4.0
4,5,5.0
5,2000,6.0


##4.Drop Missing Values

If you want to simply exclude the missing values, then use the **dropna** function along with the **axis** argument. By default, **axis=0**, i.e., along row, which means that if any value within a row is NA then the whole row is excluded.

**DataFrame.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)**

Remove missing values.

axis = 0, or ‘index’ : Drop rows which contain missing values.

axis = 1, or ‘columns’ : Drop columns which contain missing value


**Ref:** [Pandas dropna doc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html)




In [None]:
import pandas as pd
import numpy as np

student_dict = {'Name': ['Joe', 'Sam', 'Harry'], 'Age': [20, 21, 19], 'Mark': [85.10, np.nan, 91.54]}

# Create DataFrame from dict
df = pd.DataFrame(student_dict)
df

Unnamed: 0,Name,Age,Mark
0,Joe,20,85.1
1,Sam,21,
2,Harry,19,91.54


**DataFrame.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)**

Remove missing values.

axis = 0, or ‘index’ : Drop rows which contain missing values.

axis = 1, or ‘columns’ : Drop columns which contain missing value

###Ex1



In [None]:
#0, or ‘index’ : Drop rows which contain missing values.
#1, or ‘columns’ : Drop columns which contain missing values.
df_drop =  df.dropna() 
df_drop

Unnamed: 0,Name,Age,Mark
0,Joe,20,85.1
2,Harry,19,91.54


###Ex2

In [None]:
df_drop =  df.dropna(axis='columns')  
df_drop

Unnamed: 0,Name,Age
0,Joe,20
1,Sam,21
2,Harry,19


#1.2 Duplicate Data

##Check duplicates

In [None]:
import pandas as pd

student_dict = {'Name': ['Joe', 'Nat', 'Harry', 'Joe', 'Nat'], 
                        'Age': [20, 21, 19, 20, 21],
                        'Mark': [70, 77.80, 91.54, 85.10, 77.80]}

student_df = pd.DataFrame(student_dict)
student_df

Unnamed: 0,Name,Age,Mark
0,Joe,20,70.0
1,Nat,21,77.8
2,Harry,19,91.54
3,Joe,20,85.1
4,Nat,21,77.8


In [None]:
print('\nDuplicates i')
student_df.duplicated('Name') # find duplicates in column Name


Duplicates i


0    False
1    False
2    False
3     True
4     True
dtype: bool

##Ex1 Drop duplicates but keep first
When we have the DataFrame with many duplicate rows that we want to remove we use DataFrame.drop_duplicates().

**The rows that contain the same values in all the columns then are identified as duplicates. **

If the row is duplicated then by default DataFrame.drop_duplicates() keeps the first occurrence of that row and drops all other duplicates of it.

In [None]:
import pandas as pd

student_dict = {'Name': ['Joe', 'Nat', 'Harry', 'Joe', 'Nat'], 
                        'Age': [20, 21, 19, 20, 21],
                        'Mark': [70, 77.80, 91.54, 85.10, 77.80]}
                
# Create DataFrame from dict
print('Before')
student_df = pd.DataFrame(student_dict)
print(student_df)

print('\nAfter')
# drop duplicate rows
student_df = student_df.drop_duplicates()

print(student_df)


Before
    Name  Age   Mark
0    Joe   20  70.00
1    Nat   21  77.80
2  Harry   19  91.54
3    Joe   20  85.10
4    Nat   21  77.80

After
    Name  Age   Mark
0    Joe   20  70.00
1    Nat   21  77.80
2  Harry   19  91.54
3    Joe   20  85.10


##Ex2 Drop duplicates from defined columns
By default, DataFrame.drop_duplicate() removes rows with the same values in all the columns. However, we can modify this behavior using a subset parameter.

For example, subset=[col1, col2] will remove the duplicate rows with the same values in specified columns only, i.e., col1 and col2.

In [None]:
import pandas as pd

student_dict = {'Name': ['Joe', 'Nat', 'Harry', 'Joe', 'Nat'], 'Age': [20, 21, 19, 20, 21], 'Mark': [70, 77.80, 91.54, 85.10, 77.80]}

# Create DataFrame from dict
print('Before')
student_df = pd.DataFrame(student_dict)
print(student_df)

print('\nAfter')
# drop duplicate rows
student_df = student_df.drop_duplicates(subset=['Age'])

print(student_df)

Before
    Name  Age   Mark
0    Joe   20  70.00
1    Nat   21  77.80
2  Harry   19  91.54
3    Joe   20  85.10
4    Nat   21  77.80

After
    Name  Age   Mark
0    Joe   20  70.00
1    Nat   21  77.80
2  Harry   19  91.54


##Ex3 Drop duplicates but keep last
To keep only one occurrence of the duplicate row, we can use the keep parameter of a DataFrame.drop_duplicate(), which takes the following inputs:

*   first – Drop duplicates except for the first occurrence of the duplicate row. This is the default behavior.
*   last – Drop duplicates except for the last occurrence of the duplicate row.
*   False – Drop all the rows which are duplicate.


In [None]:
import pandas as pd

student_dict = {'Name': ['Joe', 'Nat', 'Harry', 'Joe', 'Nat'], 'Age': [20, 21, 19, 20, 21], 'Mark': [70, 77.80, 91.54, 85.10, 77.80]}

print('Before')
# Create DataFrame from dict
student_df = pd.DataFrame(student_dict)
print(student_df)

print('\nAfter')
# drop duplicate rows
student_df = student_df.drop_duplicates(keep='last')

print(student_df)

Before
    Name  Age   Mark
0    Joe   20  70.00
1    Nat   21  77.80
2  Harry   19  91.54
3    Joe   20  85.10
4    Nat   21  77.80

After
    Name  Age   Mark
0    Joe   20  70.00
2  Harry   19  91.54
3    Joe   20  85.10
4    Nat   21  77.80


##Ex4 Drop all duplicates
If we need to drop all the duplicate rows, then it can be done by using keep=False, as shown below.

In [None]:
import pandas as pd

student_dict = {'Name': ['Joe', 'Nat', 'Harry', 'Joe', 'Nat'], 'Age': [20, 21, 19, 20, 21], 'Mark': [70, 77.80, 91.54, 85.10, 77.80]}

print('Before')
# Create DataFrame from dict
student_df = pd.DataFrame(student_dict)
print(student_df)

print('\nAfter')
student_df = student_df.drop_duplicates(keep=False)

print(student_df)


Before
    Name  Age   Mark
0    Joe   20  70.00
1    Nat   21  77.80
2  Harry   19  91.54
3    Joe   20  85.10
4    Nat   21  77.80

After
    Name  Age   Mark
0    Joe   20  70.00
2  Harry   19  91.54
3    Joe   20  85.10


#1.3 Irregular Data (Outliers)


In [None]:
import pandas as pd
df = pd.read_csv("mystudents.csv")
df

FileNotFoundError: ignored

In [None]:
df = df.dropna()
df

##Ex1 Use intuition

In [None]:
temp = df [df['Age'] < 0]
print(temp)
print(temp.shape)

In [None]:
df= df[df['Age'] > 0  ]
df

##Ex2 Use Boxplot for Data Visualization

In [None]:
df

In [None]:
df.hist()

In [None]:
df.boxplot(column=['Student ID'])

In [None]:
df.boxplot(column=['Age'])

In [None]:
df['Age'].value_counts()

In [None]:
df['Age'].value_counts().plot.bar()

In [None]:
df.describe()

In [None]:
df['Age'].describe()

#1.4 Unnecessary or Outdated Data


In [None]:
import pandas as pd
df = pd.read_csv("mystudents.csv")
df

##Prob1 
Which rows/columns/section may be irrelevant or unnecessary?
Can you remove such data?

Hint: check out the drop() method

##Prob2 
Which rows/columns/section may be outdated?
Can you remove such data?

#2.Data Integration

##Ex 1 Merge two dataframes

In [None]:
import pandas as pd
df1 = pd.read_csv('https://raw.githubusercontent.com/TrainingByPackt/Data-Science-with-Python/master/Chapter01/Data/mark.csv',header = 0)
df2 = pd.read_csv('https://raw.githubusercontent.com/TrainingByPackt/Data-Science-with-Python/master/Chapter01/Data/student.csv',header = 0)

In [None]:
df1

In [None]:
df2

In [None]:
#Perform data integration to both the dataframe with respect to the column ‘Student_id’ using ‘pd.merge() function
df = pd.merge(df1, df2, on = 'Student_id')
df

##Ex2 Join two datafames
DataFrame.join() function is used to join one DataFrame with another DataFrame as df1.join(df2)

In [None]:
import pandas as pd

# create dataframe from dict 
print('student_df')
student_dict = {'Name': ['Joe', 'Nat'], 'Age': [20, 21]}
student_df = pd.DataFrame(student_dict)
print(student_df)

print('\nmark_df')
# create dataframe from dict 
mark_dict = {'Mark': [85.10, 77.80]}
mark_df = pd.DataFrame(mark_dict)
print(mark_df)

print('\njoined _df')
# join dfs
joined_df = student_df.join(mark_df)
print(joined_df)

##Ex3 Concatenation

In [None]:
import pandas as pd

student_dict1 = {'Name': ['Joe', 'Nat'], 'Age': [20, 21, ], 'Mark': [85.10, 77.80, ]}
student_dict2 = {'Name': [ 'Harry', 'Ethan'], 'Age': [19, 21], 'Mark': [91.54, 71.80]}
print('student_df1')
# Create DataFrame from dict
student_df1 = pd.DataFrame(student_dict1)
print(student_df1)

print('\nstudent_df2')
# Create DataFrame from dict
student_df2 = pd.DataFrame(student_dict2)
print(student_df2)

print('\ncombined')
combined_df = pd.concat([student_df1, student_df2], axis=0)  # (axis = 0 for row, axis = 1 for column)
print(combined_df)

##Ex 4: Aggregate or group dataframes based on condition
GroupBy means splitting the data and then combining them based on some condition. Large data can be divided into logical groups to analyze it.

DataFrame.groupby() function groups the DataFrame row-wise or column-wise based on the condition.

Example: To analyze each class’s average marks, we need to combine the student data based on the ‘Class’ column and calculate its average using df.groupby(col_label).mean()

In [None]:
import pandas as pd

# Create DataFrame from dict
student_dict = {'Name': ['Joe', 'Nat', 'Harry'], 'Class': ['A', 'B', 'A'], 'Mark': [85.10, 77.80, 91.54]}
student_df = pd.DataFrame(student_dict)
print(student_df)

print('\nGroupby mean')
# apply group by 
student_df = student_df.groupby('Class').mean()
print(student_df)

In [None]:
import pandas as pd

# Create DataFrame from dict
student_dict = {'Name': ['Joe', 'Nat', 'Harry'], 'Class': ['A', 'B', 'A'], 'Mark': [85.10, 77.80, 91.54]}
student_df = pd.DataFrame(student_dict)
print(student_df)

print('\nSelect only Mark column')
student_stat = student_df['Mark']
print(student_stat)

print('\nAggregate')
student_agg = student_stat.agg(['sum', 'max','mean']) #list of function names
print(student_agg)

#3.Data Transformation
**Ref:** [Label Encoding](https://www.mygreatlearning.com/blog/label-encoding-in-python/)

##Ex1 Transpose

In [None]:
import pandas as pd

# Create DataFrame from dict
student_dict = {'Name': ['Joe', 'Nat', 'Harry'], 'Class': ['A', 'B', 'A'], 'Mark': [85.10, 77.80, 91.54]}
student_df = pd.DataFrame(student_dict)
print(student_df)

print('\nTranspose View')
print(student_df.T)


##Ex2 Replace Categorical Values

In [None]:
import pandas as pd
import numpy as np
df = pd.read_csv("https://raw.githubusercontent.com/TrainingByPackt/Data-Science-with-Python/master/Chapter01/Data/student.csv")
df.head()

In [None]:
df.dtypes

In [None]:
#Use select_dtypes() function from pandas dataframe to find the categorical or non-numerical column and separate out with different dataframe df_categorical
df_categorical = df.select_dtypes(exclude=np.number)
df_categorical.head()

In [None]:
#find unique values in the column
df_categorical['Gender'].unique()

In [None]:
df_categorical = df_categorical.replace({'Male': 1, 'Female':2})
df_categorical

##Ex3 Label Encoder


In [None]:
import pandas as pd
import numpy as np
df = pd.read_csv("https://raw.githubusercontent.com/TrainingByPackt/Data-Science-with-Python/master/Chapter01/Data/student.csv")
df.head()

In [None]:
#import the LabelEncoder class
from sklearn.preprocessing import LabelEncoder
#Creating the object instance
label_encoder = LabelEncoder()

In [None]:
df['Gender'] = label_encoder.fit_transform(df['Gender'])
df.head()

##Ex4 One Hot Encoder

Alternative: 
use OneHotEncoder class from sklearn.preprocessing

In [None]:
import pandas as pd
import numpy as np
df = pd.read_csv("https://raw.githubusercontent.com/TrainingByPackt/Data-Science-with-Python/master/Chapter01/Data/student.csv")
df.head()

In [None]:
df['Grade'].unique()

In [None]:
#get_dummies creates a one-hot encoding for each unique categorical
#value in the column named col_name. Here, the prefix is added at the beginning of each categorical value 
#to create new column names for the one-hot columns
#df_onehot = pd.get_dummies(df, columns=['Grade'], prefix=['one_hot'])

df_onehot = pd.get_dummies(df, columns=['Grade'])
df_onehot.head()

In [None]:
#import the OneHotEncoder class
from sklearn.preprocessing import OneHotEncoder
#Creating the object instance
onehot_encoder = OneHotEncoder(sparse=False)

##Ex5 Scaling

In [None]:
import pandas as pd

# Create DataFrame from dict
student_dict = {'Name': ['Joe', 'Nat', 'Harry'], 'Class': ['A', 'B', 'A'], 'Mark': [65.10, 55.80, 71.54]}
student_df = pd.DataFrame(student_dict)
print(student_df)

In [None]:
student_df = pd.DataFrame(student_dict)
student_df['Mark'] = student_df['Mark'] + 10
print(student_df)

In [None]:
student_df = pd.DataFrame(student_dict)
student_df['Mark'] = 0
print(student_df)

In [None]:
student_df = pd.DataFrame(student_dict)
temp_df = student_df[ student_df['Mark'] > 60 ]
print(temp_df)

##Ex6 Min-Max Scaling
Perform the Normalization scaling by using MinMaxScaler() class from sklearn.preprocessing and its fit_transorm() method.

Transform columns by scaling each column to a given range individually,  e.g. between zero and one.

Ref: [sklearn.preprocessing.MinMaxScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html)

In [None]:
import pandas as pd
df = pd.read_csv("https://raw.githubusercontent.com/TrainingByPackt/Data-Science-with-Python/master/Chapter01/Data/Wholesale%20customers%20data.csv")
df.head()

In [None]:
null_df = df.isna().any()
print(null_df)

In [None]:
from sklearn.preprocessing import MinMaxScaler
#Creating the object instance
min_max_scaler = MinMaxScaler()

#norm_scale = MinMaxScaler().fit_transform(df) #alternative

norm_scale = min_max_scaler.fit_transform(df) 
scaled_frame = pd.DataFrame(norm_scale,columns=df.columns)
scaled_frame.head()

##Ex7 Discretization
Data discretization is the process of converting continuous data into discrete buckets by grouping it.
pandas.cut()  function is useful to achieve the bucketing and sorting of segmented data. 

In [None]:
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/TrainingByPackt/Data-Science-with-Python/master/Chapter01/Data/Student_bucketing.csv',header = 0)
df.head(10)

In [None]:
df['mark_class'] = pd.cut(df['marks'],5,labels = ['Poor','Below_average','Average','Above_Average','Excellent'])
df.head(10)

In [None]:
df['mark_class'].unique()

In [None]:
df.mark_class.value_counts()