# Problem Statement
Perform the following operations using Python on any open-source dataset (e.g., data.csv)
1. Import all the required Python Libraries.
2. Locate an open-source data from the web (e.g.
https://www.kaggle.com). Provide a clear description of the data and its source (i.e., URL of the web site).
3. Load the Dataset into pandas’ data frame.
4. Data Preprocessing: check for missing values in the data using pandas isnull (), describe() function to get some initial
statistics. Provide variable descriptions. Types of variables
etc. Check the dimensions of the data frame.
5. Data Formatting and Data Normalization: Summarize the
types of variables by checking the data types (i.e., character,
numeric, integer, factor, and logical) of the variables in the
data set. If variables are not in the correct data type, apply
proper type conversions.
6. Turn categorical variables into quantitative variables in Python. 

In addition to the codes and outputs, explain every operation that you do in the above steps and explain everything that you do to import/read/scrape the data set.

In [76]:
import pandas as pd

In [77]:
df = pd.read_csv("StudentsPerformance.csv")

In [78]:
df

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
0,female,group B,bachelor's degree,standard,none,72,72.0,74.0
1,female,group C,some college,standard,completed,69,90.0,88.0
2,female,group B,master's degree,standard,none,90,95.0,93.0
3,male,group A,associate's degree,free/reduced,none,47,57.0,44.0
4,male,group C,some college,standard,none,76,78.0,75.0
...,...,...,...,...,...,...,...,...
995,female,group E,master's degree,standard,completed,88,99.0,95.0
996,male,group C,high school,free/reduced,none,62,55.0,55.0
997,female,group C,high school,free/reduced,completed,59,71.0,65.0
998,female,group D,some college,standard,completed,68,78.0,77.0


In [79]:
df.head()

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
0,female,group B,bachelor's degree,standard,none,72,72.0,74.0
1,female,group C,some college,standard,completed,69,90.0,88.0
2,female,group B,master's degree,standard,none,90,95.0,93.0
3,male,group A,associate's degree,free/reduced,none,47,57.0,44.0
4,male,group C,some college,standard,none,76,78.0,75.0


In [80]:
df.tail()

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
995,female,group E,master's degree,standard,completed,88,99.0,95.0
996,male,group C,high school,free/reduced,none,62,55.0,55.0
997,female,group C,high school,free/reduced,completed,59,71.0,65.0
998,female,group D,some college,standard,completed,68,78.0,77.0
999,female,group D,some college,free/reduced,none,77,86.0,86.0


In [81]:
df.sample(5)

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
682,male,group B,high school,standard,none,62,55.0,54.0
442,female,group A,some high school,free/reduced,none,?,73.0,69.0
28,male,group C,high school,standard,none,70,4.0,
382,male,group C,master's degree,free/reduced,none,79,81.0,71.0
804,female,group C,some college,standard,none,73,76.0,78.0


In [82]:
df.shape

(1000, 8)

In [83]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   gender                       1000 non-null   object 
 1   race/ethnicity               1000 non-null   object 
 2   parental level of education  1000 non-null   object 
 3   lunch                        1000 non-null   object 
 4   test preparation course      1000 non-null   object 
 5   math score                   992 non-null    object 
 6   reading score                994 non-null    float64
 7   writing score                991 non-null    float64
dtypes: float64(2), object(6)
memory usage: 62.6+ KB


In [84]:
df.describe()

Unnamed: 0,reading score,writing score
count,994.0,991.0
mean,68.008048,69.487386
std,16.60227,29.563757
min,3.0,10.0
25%,58.0,57.0
50%,69.5,69.0
75%,79.0,79.0
max,100.0,567.0


In [85]:
df.isnull().sum()

gender                         0
race/ethnicity                 0
parental level of education    0
lunch                          0
test preparation course        0
math score                     8
reading score                  6
writing score                  9
dtype: int64

In [86]:
df.dtypes

gender                          object
race/ethnicity                  object
parental level of education     object
lunch                           object
test preparation course         object
math score                      object
reading score                  float64
writing score                  float64
dtype: object

In [87]:
df['math score'].unique()

array(['72', '69', '90', '47', '76', '71', '88', '40', '64', '38', '58',
       '65', '78', '50', nan, '74', '73', '67', '70', '62', '63', '56',
       '81', '75', '57', '55', '53', '59', '66', '82', '77', '33', '52',
       '0', '79', '39', '45', '60', '61', '41', '49', '44', '30', '80',
       '42', '27', '43', '68', '85', '98', '87', '54', '51', '99', '84',
       '91', '83', '89', '22', '100', '96', '94', '46', '97', '48', '35',
       '34', '86', '92', '37', '28', '24', '113', '123', '?', '-89', '26',
       '334', '95', '36', '29', '32', '93', '19', '23', '8'], dtype=object)

In [88]:
# Removing ? and -ve value
df['math score'] = df['math score'].drop(df[ df['math score'] == '?'].index )
df['math score'] = df['math score'].drop( df[df['math score'] < '0'].index )

In [89]:
df['math score'].unique()

array(['72', '69', '90', '47', '76', '71', '88', '40', '64', '38', '58',
       '65', '78', '50', nan, '74', '73', '67', '70', '62', '63', '56',
       '81', '75', '57', '55', '53', '59', '66', '82', '77', '33', '52',
       '0', '79', '39', '45', '60', '61', '41', '49', '44', '30', '80',
       '42', '27', '43', '68', '85', '98', '87', '54', '51', '99', '84',
       '91', '83', '89', '22', '100', '96', '94', '46', '97', '48', '35',
       '34', '86', '92', '37', '28', '24', '113', '123', '26', '334',
       '95', '36', '29', '32', '93', '19', '23', '8'], dtype=object)

In [90]:
# Filling missing values
# Imputation Using (Mean/Median/Mode)
# df['math score'] = df['math score'].fillna(df['math score'].mean())
# df['reading score'] = df['reading score'].fillna(df['reading score'].mean())
# df['writing score'] = df['writing score'].fillna(df['writing score'].mean())

#Using forward fill
df['math score'] = df['math score'].fillna(method='ffill')
df['reading score'] = df['reading score'].fillna(method='ffill')
df['writing score'] = df['writing score'].fillna(method='ffill')

#Using backward fill
# df['math score'] = df['math score'].fillna(method='bfill')
# df['reading score'] = df['reading score'].fillna(method='bfill')
# df['writing score'] = df['writing score'].fillna(method='bfill')

In [91]:
# Conversion from float to int
df['math score'] = df['math score'].astype('int64')
df['reading score'] = df['reading score'].astype('int64')
df['writing score'] = df['writing score'].astype('int64')

df.dtypes

gender                         object
race/ethnicity                 object
parental level of education    object
lunch                          object
test preparation course        object
math score                      int64
reading score                   int64
writing score                   int64
dtype: object

In [92]:
df['math score'].head()

0    72
1    69
2    90
3    47
4    76
Name: math score, dtype: int64

In [95]:
#Data Normalization - MinMAx = (x-min)/(max-min)
maxx = df['math score'].max()
mini = df['math score'].min()
x = df['math score']
df['math score'] = (x - mini) / (maxx - mini)

df['math score'].head()

0    0.215569
1    0.206587
2    0.269461
3    0.140719
4    0.227545
Name: math score, dtype: float64

In [98]:
df['gender'].unique()

array(['female', 'male'], dtype=object)

In [99]:
# Label Encoding
df['gender'].replace({'female':0,'male':1},inplace=True)

In [100]:
df['gender'].unique()

array([0, 1], dtype=int64)

In [103]:
df['gender'].describe()

count    1000.000000
mean        0.482000
std         0.499926
min         0.000000
25%         0.000000
50%         0.000000
75%         1.000000
max         1.000000
Name: gender, dtype: float64