<a href="https://colab.research.google.com/github/Mohith29/Machine-Learning-/blob/master/Data_Munging.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Advanced Certification in AIML
## A Program by IIIT-H and TalentSprint



## Learning Objective

At the end of this experiment, you will be able to:

* perform Data preprocessing

In [None]:
#@title Mini Hackathon Walkthrough
from IPython.display import HTML

HTML("""<video width="320" height="240" controls>
  <source src="https://cdn.talentsprint.com/talentsprint1/archives/sc/aiml/aiml_batch_15/preview_videos/Mini_Hackathon_Data_Munging_Briefing.mp4" type="video/mp4">
</video>
""")

## Problem Statement

We will be using district wise demographics, enrollments, and teacher indicator data to predict whether the literacy rate is high/ medium/ low in each district.

### Data Preprocessing

Data preprocessing is an important step in solving every machine learning problem. Most of
the datasets used with Machine Learning problems need to be processed / cleaned / transformed
so that a Machine Learning algorithm can be trained on it.

There are different steps involved in Data Preprocessing. These steps are as follows:

    1. Data Cleaning → In this step the primary focus is on
        - Handling missing data
        - Handling noisy data
        - Detection and removal of outliers
    
    2. Data Integration → This process is used when data is gathered from various data sources
    and data are combined to form consistent data. This data after performing cleaning is used
    for analysis.
    
    3. Data Transformation → In this step we will convert the raw data into a specified format according to the need of the model we are building. There are many options used for
    transforming the data as below:
        - Normalization
        - Aggregation
        - Generalization
        
    4. Data Reduction → Following data transformation and scaling, the redundancy within the data is removed and is organized efficiently.



### Total Marks  = 20

In [None]:
# @title Download the datasets
from IPython import get_ipython

ipython = get_ipython()
  
notebook="U1_MH1_Data_Munging" #name of the notebook

def setup():
    from IPython.display import HTML, display
    ipython.magic("sx wget https://cdn.iiith.talentsprint.com/aiml/Experiment_related_data/B15_Data_Munging.zip")
    ipython.magic("sx unzip B15_Data_Munging.zip")
    print("Data downloaded successfully")
    return

setup()

Data downloaded successfully


In [None]:
!ls

## Exercise 1 - Load and Explore the Data (2 Marks)
1. We have three different files

  * Districtwise_Basicdata.csv
  * Districtwise_Enrollment_details_indicator.csv
  * Districtwise_Teacher_indicator.csv

  These files contain the necessary data to solve the problem. <br>

2. Load the files based on **team allocation** mentioned below. Observe the header level details, data records while loading the data.
  
  Hint : Use read_csv from pandas with [skiprows or header](https://towardsdatascience.com/import-csv-files-as-pandas-dataframe-with-skiprows-skipfooter-usecols-index-col-and-header-fbf67a2f92a) options.

3. Read the columns of the dataset and rename if required.

  Hint : Rename column names (if any) using the following [link](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rename.html).

Team allocation for dataset selection

    Team A = 1,3,5,7,9,11,13,15 
        Districtwise_Basicdata.csv
        Districtwise_Enrollment_details_indicator.csv

    Team B = 2,4,6,8,10,12,14,16
        Districtwise_Basicdata.csv
        Districtwise_Teacher_indicator.csv

In [None]:
# Importing all the required packages and add neccesary imports if required
import pandas as pd
import numpy as np

In [None]:
# YOUR CODE HERE for loading and exploring the datasets

data1 = pd.read_csv('Districtwise_Basicdata.csv',header=1)
data2 = pd.read_csv('Districtwise_Teacher_indicator.csv',header=3)

In [None]:
print(data1.head())
print(data2.head())

      Year  Statecd  ... overall_lit  female_lit
0  2012-13       35  ...        High       84.52
1  2012-13       35  ...        High       79.39
2  2012-13       35  ...        High       70.70
3  2012-13       28  ...         Low       51.99
4  2012-13       28  ...         Low       54.31

[5 rows x 19 columns]
   statecd  ... tch_nontch
0       35  ...        519
1       35  ...        362
2       35  ...         28
3       28  ...        263
4       28  ...       1185

[5 rows x 181 columns]


In [None]:
data1.head()
data1 = data1.rename(columns={"Statecd": "State Code"})
data1 = data1.rename(columns={"distcd": "District Code"})

In [None]:
data2 = data2.rename(columns={"ac_year": "Year"})
data2 = data2.rename(columns={"statecd": "State Code"})
data2 = data2.rename(columns={"distcd": "District Code"})

data2.head()

Unnamed: 0,State Code,statename,District Code,distname,Year,tch_govt1,tch_govt2,tch_govt3,tch_govt4,tch_govt5,tch_govt6,tch_govt7,tch_govt9,tch_pvt1,tch_pvt2,tch_pvt3,tch_pvt4,tch_pvt5,tch_pvt6,tch_pvt7,tch_pvt9,tch_un1,tch_un2,tch_un3,tch_un4,tch_un5,tch_un6,tch_un7,tch_un9,tch_bs1,tch_bs2,tch_bs3,tch_bs4,tch_bs5,tch_bs6,tch_bs7,tch_bs_p,tch_s1,tch_s2,tch_s3,...,tch_sc_m7,tch_sc_f1,tch_sc_f2,tch_sc_f3,tch_sc_f4,tch_sc_f5,tch_sc_f6,tch_sc_f7,tch_st_m1,tch_st_m2,tch_st_m3,tch_st_m4,tch_st_m5,tch_st_m6,tch_st_m7,tch_st_f1,tch_st_f2,tch_st_f3,tch_st_f4,tch_st_f5,tch_st_f6,tch_st_f7,trn_tch_m1,trn_tch_m2,trn_tch_m3,trn_tch_m4,trn_tch_m5,trn_tch_m6,trn_tch_m7,trn_tch_f1,trn_tch_f2,trn_tch_f3,trn_tch_f4,trn_tch_f5,trn_tch_f6,trn_tch_f7,prof_trn_tch_r,prof_trn_tch_p,days_nontch,tch_nontch
0,35,ANDAMAN & NICOBAR ISLANDS ...,3501,ANDAMANS ...,2012-13,329,429,1097,0,127,432,0,0,308,117,317,0,0,83,0,0,0,0,0,0,0,0,0,0,15,9,13,0,0,4,0,11,9,10,25,...,0,0,0,1,0,0,0,0,4,3,16,0,0,8,0,11,5,18,0,1,11,0,69,97,64,0,14,66,0,134,176,135,0,22,103,0,2968,228,12,519
1,35,ANDAMAN & NICOBAR ISLANDS ...,3503,MIDDLE AND NORTH ANDAMANS ...,2012-13,305,285,194,95,268,175,0,0,103,31,0,0,0,15,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,3,7,4,3,...,0,1,1,0,0,0,0,0,0,1,6,2,7,0,0,2,2,3,0,4,2,0,126,79,32,8,45,37,0,84,85,40,3,28,60,0,1249,203,8,362
2,35,ANDAMAN & NICOBAR ISLANDS ...,3502,NICOBARS ...,2012-13,110,95,56,0,135,114,0,0,8,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,26,7,8,...,0,0,0,0,0,0,0,0,33,29,18,0,27,27,0,53,32,16,0,28,40,0,36,25,18,0,22,31,0,58,29,23,0,17,46,0,430,78,20,28
3,28,ANDHRA PRADESH ...,2801,ADILABAD ...,2012-13,4749,1788,38,0,22,939,4309,0,2004,2298,0,0,0,327,2222,0,214,82,0,0,0,5,5,0,0,0,0,0,0,0,0,0,798,152,0,...,621,401,252,4,0,0,37,294,1447,394,1,0,1,409,260,288,126,1,0,0,187,146,1964,554,0,0,0,46,641,830,267,0,0,0,8,248,16419,845,13,263
4,28,ANDHRA PRADESH ...,2822,ANANTAPUR ...,2012-13,5797,2879,209,8,6733,2,52,0,2063,2184,106,0,2307,0,41,0,52,97,0,0,8,0,0,0,0,0,0,0,0,0,0,0,132,80,1,...,1,457,236,12,0,444,1,4,281,143,5,0,210,0,3,136,69,3,0,139,0,0,2521,1161,2,0,1226,0,10,1652,726,0,0,591,0,3,21487,676,14,1185


## Exercise 2  - Data Integration (3 Marks)

As the required data is present in different datasets, we need to **integrate both to make a single dataframe/dataset**.
  * For integrating the datasets, create a unique identifier for each row in both the dataframes so that it can be used to map the data in different files.
   
    * Combine year, state code, district code columns and form a new unique identifier column, refer this [link](https://stackoverflow.com/questions/33098383/merge-multiple-column-values-into-one-column-in-python-pandas).
    * Set the identifier column as the index for each dataframe.

    * Integrate the dataframes using the above index
     
     Hint: For merging or joining the datasets, refer to this [link](https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html)

**Example:** Data of the district Anantapur in Andrapradesh, which is present in different files should form a single row after integrating the datasets


In [None]:
# YOUR CODE HERE for integrating the datasets

data1['info'] = data1[['Year', 'State Code', 'District Code']].apply(
    lambda x: ''.join(x.dropna().astype(str)),
    axis=1
)


Unnamed: 0,Year,State Code,statename,District Code,distname,blocks,clusters,villages,totschools,totpopulation,p_06_pop,p_urb_pop,sexratio,sexratio_06,growthrate,p_sc_pop,p_st_pop,overall_lit,female_lit,info
0,2012-13,35,ANDAMAN & NICOBAR ISLANDS ...,3501,ANDAMANS ...,3,16,83,212,237586.0,23616.05,55.89,874.0,980.0,13.97,0.0,1.72,High,84.52,2012-13353501
1,2012-13,35,ANDAMAN & NICOBAR ISLANDS ...,3503,MIDDLE AND NORTH ANDAMANS ...,3,13,76,181,105539.0,11651.51,2.6,925.0,975.0,-0.07,0.0,0.72,High,79.39,2012-13353503
2,2012-13,35,ANDAMAN & NICOBAR ISLANDS ...,3502,NICOBARS ...,3,8,42,58,36819.0,4226.82,0.0,778.0,961.0,-12.48,0.0,64.28,High,70.7,2012-13353502
3,2012-13,28,ANDHRA PRADESH ...,2801,ADILABAD ...,52,356,1576,4983,2737738.0,295675.7,27.68,1003.0,942.0,10.04,17.82,18.09,Low,51.99,2012-13282801
4,2012-13,28,ANDHRA PRADESH ...,2822,ANANTAPUR ...,63,564,929,5188,4083315.0,427114.75,28.09,977.0,927.0,12.16,14.29,3.78,Low,54.31,2012-13282822


In [None]:
data2['info'] = data2[['Year', 'State Code', 'District Code']].apply(
    lambda x: ''.join(x.dropna().astype(str)),
    axis=1
)


In [None]:
data1 = data1.set_index('info')
data2 = data2.set_index('info')





In [None]:
frames = [data1, data2]
result = pd.concat((frames),axis=1)

## Exercise 3 - Data Cleaning (3 Marks)

1.  **Overall_lit** is our target variable. Delete rows with missing overall_lit value

   Hint: Refer to the link [dropna](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html).


2.  Convert categorical values to numerical values.

  For example, If a feature contains categorical values such as dog, cat, mouse, etc then replace them with 1, 2, 3, etc or use [Sklearn LabelEncoder's](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html) 

3. Replace the missing values in any other column appropriately with mean / median / mode.

  Hint: Use pandas [fillna](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html) function to replace the missing values




In [None]:
# YOUR CODE HERE for data cleaning

#result.isnull().sum()
result = result.dropna(subset=['overall_lit'])
result.shape

(1268, 200)

In [None]:
result = result.loc[:,~result.columns.duplicated()]


In [None]:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()

result['statename'] = le.fit_transform(result['statename'])
result['distname'] = le.fit_transform(result['distname'])
result['overall_lit'] = le.fit_transform(result['overall_lit'])

result.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0_level_0,Year,State Code,statename,District Code,distname,blocks,clusters,villages,totschools,totpopulation,p_06_pop,p_urb_pop,sexratio,sexratio_06,growthrate,p_sc_pop,p_st_pop,overall_lit,female_lit,tch_govt1,tch_govt2,tch_govt3,tch_govt4,tch_govt5,tch_govt6,tch_govt7,tch_govt9,tch_pvt1,tch_pvt2,tch_pvt3,tch_pvt4,tch_pvt5,tch_pvt6,tch_pvt7,tch_pvt9,tch_un1,tch_un2,tch_un3,tch_un4,tch_un5,...,tch_sc_m7,tch_sc_f1,tch_sc_f2,tch_sc_f3,tch_sc_f4,tch_sc_f5,tch_sc_f6,tch_sc_f7,tch_st_m1,tch_st_m2,tch_st_m3,tch_st_m4,tch_st_m5,tch_st_m6,tch_st_m7,tch_st_f1,tch_st_f2,tch_st_f3,tch_st_f4,tch_st_f5,tch_st_f6,tch_st_f7,trn_tch_m1,trn_tch_m2,trn_tch_m3,trn_tch_m4,trn_tch_m5,trn_tch_m6,trn_tch_m7,trn_tch_f1,trn_tch_f2,trn_tch_f3,trn_tch_f4,trn_tch_f5,trn_tch_f6,trn_tch_f7,prof_trn_tch_r,prof_trn_tch_p,days_nontch,tch_nontch
info,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1
2012-13353501,2012-13,35,0,3501,21,3,16,83,212,237586.0,23616.05,55.89,874.0,980.0,13.97,0.0,1.72,0,84.52,329,429,1097,0,127,432,0,0,308,117,317,0,0,83,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,4,3,16,0,0,8,0,11,5,18,0,1,11,0,69,97,64,0,14,66,0,134,176,135,0,22,103,0,2968,228,12,519
2012-13353503,2012-13,35,0,3503,382,3,13,76,181,105539.0,11651.51,2.6,925.0,975.0,-0.07,0.0,0.72,0,79.39,305,285,194,95,268,175,0,0,103,31,0,0,0,15,0,0,0,0,0,0,0,...,0,1,1,0,0,0,0,0,0,1,6,2,7,0,0,2,2,3,0,4,2,0,126,79,32,8,45,37,0,84,85,40,3,28,60,0,1249,203,8,362
2012-13353502,2012-13,35,0,3502,422,3,8,42,58,36819.0,4226.82,0.0,778.0,961.0,-12.48,0.0,64.28,0,70.7,110,95,56,0,135,114,0,0,8,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,33,29,18,0,27,27,0,53,32,16,0,28,40,0,36,25,18,0,22,31,0,58,29,23,0,17,46,0,430,78,20,28
2012-13282801,2012-13,28,1,2801,0,52,356,1576,4983,2737738.0,295675.7,27.68,1003.0,942.0,10.04,17.82,18.09,1,51.99,4749,1788,38,0,22,939,4309,0,2004,2298,0,0,0,327,2222,0,214,82,0,0,0,...,621,401,252,4,0,0,37,294,1447,394,1,0,1,409,260,288,126,1,0,0,187,146,1964,554,0,0,0,46,641,830,267,0,0,0,8,248,16419,845,13,263
2012-13282822,2012-13,28,1,2822,19,63,564,929,5188,4083315.0,427114.75,28.09,977.0,927.0,12.16,14.29,3.78,1,54.31,5797,2879,209,8,6733,2,52,0,2063,2184,106,0,2307,0,41,0,52,97,0,0,8,...,1,457,236,12,0,444,1,4,281,143,5,0,210,0,3,136,69,3,0,139,0,0,2521,1161,2,0,1226,0,10,1652,726,0,0,591,0,3,21487,676,14,1185


In [None]:
result = result.drop(columns='Year')

In [None]:
values = {}
for i in result.columns:
  values[i]=result[i].mean()
print(values)
result=result.fillna(value=values)

{'State Code': 17.033123028391167, 'statename': 19.644321766561514, 'District Code': 1719.4179810725552, 'distname': 317.66009463722395, 'blocks': 11.115930599369085, 'clusters': 128.6955835962145, 'villages': 912.1175078864353, 'totschools': 2207.024447949527, 'totpopulation': 1899024.1324921136, 'p_06_pop': 251190.74210110572, 'p_urb_pop': 24.819255150554696, 'sexratio': 942.6782334384858, 'sexratio_06': 918.8135860979463, 'growthrate': 17.6278864353312, 'p_sc_pop': 14.830015923566897, 'p_st_pop': 17.625031847133766, 'overall_lit': 0.944794952681388, 'female_lit': 64.61935331230288, 'tch_govt1': 3084.332807570978, 'tch_govt2': 1766.9392744479496, 'tch_govt3': 155.03864353312304, 'tch_govt4': 617.8824921135647, 'tch_govt5': 806.179022082019, 'tch_govt6': 159.47791798107255, 'tch_govt7': 420.55047318611986, 'tch_govt9': 0.5047318611987381, 'tch_pvt1': 898.6317034700315, 'tch_pvt2': 1084.2421135646687, 'tch_pvt3': 830.4266561514196, 'tch_pvt4': 201.5993690851735, 'tch_pvt5': 507.1466876

## Exercise 4 - (3 Marks)

1. Remove the unneccesary columns which are not contributing to the overall literacy rate

2. Verify if there are any duplicate columns and remove them.

  For example: state name and district name are same as state code and district code.

3. Make sure that the final dataframe has no null or nan values. Delete the rows with missing values.

   Hint: Give df.isna() to verify on the nan values in the dataframe. 

In [None]:
# YOUR CODE HERE for cleaning the dataframe
result.isna().sum()

State Code        0
statename         0
District Code     0
distname          0
blocks            0
                 ..
trn_tch_f7        0
prof_trn_tch_r    0
prof_trn_tch_p    0
days_nontch       0
tch_nontch        0
Length: 194, dtype: int64

In [None]:
result.drop(['statename', 'distname'], axis=1, inplace=True)

In [None]:
result.shape

(1268, 192)

## Exercise 5 - Apply Correlation Matrix (2 Marks)

Correlation is a statistical technique that can show whether and how strongly pairs of variables are related. More number of features does not imply better accuracy. More features may lead to a decline in the accuracy and create noise in the model, if they contain any irrelevant features.

*Features with high correlation value will imply the same meaning. Hence removing the highly correlated features*

**Function Description:**

`remove_Highly_Correlated()` function removes highly correlated features in the dataframe.
- Creates a correlation matrix of row and column wise features
- Extracts only uppertriangular matrix as correlation matrix, which will have the same values below and above the diagonal
- Removes columns which are having correlation value more than the threshold value.

In [None]:
def remove_Highly_Correlated(df, bar=0.9):
  # Creates correlation matrix
  corr = df.corr()

  # Set Up Mask To Hide Upper Triangle
  mask = np.triu(np.ones_like(corr, dtype=bool))
  tri_df = corr.mask(mask)

  # Finding features with correlation value more than specified threshold value (bar=0.9)
  highly_cor_col = [col for col in tri_df.columns if any(tri_df[col] > bar )]
  print("length of highly correlated columns",len(highly_cor_col))

  # Drop the highly correlated columns
  reduced_df = result.drop(highly_cor_col, axis = 1)
  print("shape of total data",result.shape,"shape of reduced data",reduced_df.shape)
  return reduced_df

In [None]:
# YOUR CODE HERE to remove highly correlated features from the dataframe by calling above function.
df = remove_Highly_Correlated(result)

length of highly correlated columns 26
shape of total data (1268, 192) shape of reduced data (1268, 166)


## Exercise 6 - (3 Marks)

Perform Mean Correction and Standard Scaling on the data feature/column wise.

**Hint:** In order to understand the idea behind the terms used above, you may refer the following link: 

[StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)

In [None]:
features = df.iloc[:, df.columns != 'overall_lit']

In [None]:
# YOUR CODE HERE
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaled_features = scaler.fit_transform(features.values) #lost index and columns name
scaled_features_df = pd.DataFrame(scaled_features, index=features.index, columns=features.columns)  


## Exercise 7 - (3 Marks)

Apply different classifiers on the preprocessed data and figure out which classifier gives the best result.

* Split the data into train and test

* Fit the model with train data and find the accuracy of test data

### Expected Accuracy is above 90%

In [None]:
df.columns[12]

'overall_lit'

In [None]:
# YOUR CODE HERE for applying different classifiers
#labels = df.iloc[:,16].values
#df1 = scaled_features_df.iloc[:, scaled_features_df.columns != 'overall_lit']

features = scaled_features_df.values # considering first 15 columns as X
labels = df.iloc[:,12].values # considering 16 column (class) as y  



In [None]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(features,labels,test_size=0.2,random_state=42)

In [None]:
X_train.shape, X_test.shape, y_test.shape, y_train.shape 

((1014, 165), (254, 165), (254,), (1014,))

In [None]:
#Decision Tree

from sklearn import tree
from sklearn.metrics import accuracy_score
clf = tree.DecisionTreeClassifier(criterion='entropy',random_state=42)

# Fitting the data

clf = clf.fit(X_train,y_train)

# Calculating the labels for test data
pred = clf.predict(X_test)
accuracy_score(y_test, pred)



0.937007874015748

In [None]:


from sklearn.neighbors import KNeighborsClassifier

k = 7
X_train,X_test,y_train,y_test = train_test_split(features,labels,test_size=0.2,  random_state = 42)
neigh = KNeighborsClassifier(n_neighbors=k)
neigh.fit(X_train,y_train)
prediction = neigh.predict(X_test)
#neigh.score(X_test,y_test)
accuracy_score(y_test,prediction)

0.7204724409448819

In [None]:
#Random Forest

from sklearn.ensemble import RandomForestClassifier
clf1 = RandomForestClassifier(criterion='entropy', random_state=42)

# YOUR CODE HERE
clf1.fit(X_train, y_train)
y_pred = clf1.predict(X_test)
accuracy_score(y_test, y_pred)


0.9291338582677166

In [None]:
#Bagging
from sklearn.ensemble import BaggingClassifier
from sklearn import metrics

Bag = BaggingClassifier(random_state=42)

# YOUR CODE HERE to fit and predict

Bag.fit(X_train, y_train)
y_pred1 = Bag.predict(X_test)
accuracy_score(y_pred1, y_test)

0.9291338582677166