# Lok Sabha Election 2019: Data Analysis and Prediction using Machine Learning

<img src="https://www.ft.com/__origami/service/image/v2/images/raw/http%3A%2F%2Fcom.ft.imagepublish.upp-prod-us.s3.amazonaws.com%2F5c2322c8-7deb-11e9-81d2-f785092ab560?fit=scale-down&source=next&width=700" width="1000">

## Table of contents

1. Introduction
2. Environment setup
3. Gathering the data
4. Features
5. Exploratory Data Analysis (EDA)
6. Preparing th data
7. Machine learning model experimentation

## 1. Introduction
<br>
<p> <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/3/33/March_2020_Parliament_Lok_sabha.svg/1200px-March_2020_Parliament_Lok_sabha.svg.png", style="float:left", width="300", height="180"> The Lok Sabha is composed of representatives of the people chosen by direct election on the basis of the adult suffrage. The maximum strength of the House envisaged by the Constitution is 552, which is made up by election of upto 530 members to represent the States, upto 20 members to represent the Union Territories and not more than two members of the Anglo-Indian Community to be nominated by the Hon'ble President, if,  in his/her opinion, that community is not adequately represented in the House. The total elective membership is distributed among the States in such a way that the ratio between the number of seats allotted to each State and the population of the State is, so far as practicable, the same for all States.</p>
<p> 

## 2. Environment setup

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from plotly import tools
import chart_studio.plotly as py
import plotly.graph_objs as go
import plotly.figure_factory as ff
import plotly.express as px
from plotly.subplots import make_subplots
from plotly.offline import download_plotlyjs, init_notebook_mode
%matplotlib inline
init_notebook_mode(connected=True)

## 3. Gathering the data

> The data for this project has been collected from Kaggle:
https://www.kaggle.com/prakrutchauhan/indian-candidates-for-general-election-2019

In [2]:
# Loading the dataset
df = pd.read_csv("data/LS_2.0.csv")

In [3]:
df.head()

Unnamed: 0,STATE,CONSTITUENCY,NAME,WINNER,PARTY,SYMBOL,GENDER,CRIMINAL\nCASES,AGE,CATEGORY,EDUCATION,ASSETS,LIABILITIES,GENERAL\nVOTES,POSTAL\nVOTES,TOTAL\nVOTES,OVER TOTAL ELECTORS \nIN CONSTITUENCY,OVER TOTAL VOTES POLLED \nIN CONSTITUENCY,TOTAL ELECTORS
0,Telangana,ADILABAD,SOYAM BAPU RAO,1,BJP,Lotus,MALE,52.0,52.0,ST,12th Pass,"Rs 30,99,414\n ~ 30 Lacs+","Rs 2,31,450\n ~ 2 Lacs+",376892,482,377374,25.330684,35.468248,1489790
1,Telangana,ADILABAD,Godam Nagesh,0,TRS,Car,MALE,0.0,54.0,ST,Post Graduate,"Rs 1,84,77,888\n ~ 1 Crore+","Rs 8,47,000\n ~ 8 Lacs+",318665,149,318814,21.399929,29.96437,1489790
2,Telangana,ADILABAD,RATHOD RAMESH,0,INC,Hand,MALE,3.0,52.0,ST,12th Pass,"Rs 3,64,91,000\n ~ 3 Crore+","Rs 1,53,00,000\n ~ 1 Crore+",314057,181,314238,21.092771,29.534285,1489790
3,Telangana,ADILABAD,NOTA,0,NOTA,,,,,,,,,13030,6,13036,0.875023,1.225214,1489790
4,Uttar Pradesh,AGRA,Satyapal Singh Baghel,1,BJP,Lotus,MALE,5.0,58.0,SC,Doctorate,"Rs 7,42,74,036\n ~ 7 Crore+","Rs 86,06,522\n ~ 86 Lacs+",644459,2416,646875,33.383823,56.464615,1937690


In [4]:
df.shape

(2263, 19)

In [5]:
df.columns

Index(['STATE', 'CONSTITUENCY', 'NAME', 'WINNER', 'PARTY', 'SYMBOL', 'GENDER',
       'CRIMINAL\nCASES', 'AGE', 'CATEGORY', 'EDUCATION', 'ASSETS',
       'LIABILITIES', 'GENERAL\nVOTES', 'POSTAL\nVOTES', 'TOTAL\nVOTES',
       'OVER TOTAL ELECTORS \nIN CONSTITUENCY',
       'OVER TOTAL VOTES POLLED \nIN CONSTITUENCY', 'TOTAL ELECTORS'],
      dtype='object')

## 4. Features

- 'STATE'
- 'CONSTITUENCY'
- 'NAME'
- 'WINNER'
- 'PARTY'
- 'SYMBOL'
- 'GENDER'
- 'CRIMINAL CASES'
- 'AGE'
- 'CATEGORY'
- 'EDUCATION'
- 'ASSETS'
- 'LIABILITIES'
- 'GENERAL VOTES' 
- 'POSTAL VOTES'
- 'TOTAL VOTES'
- 'OVER TOTAL ELECTORS IN CONSTITUENCY',
- 'OVER TOTAL VOTES POLLED IN CONSTITUENCY'
- 'TOTAL ELECTORS'

## 5. Exploratory Data Analysis (EDA)

In [6]:
# Replacing the column names having \n with ' '
df.columns = df.columns.str.replace('\n',' ')

In [7]:
df.head()

Unnamed: 0,STATE,CONSTITUENCY,NAME,WINNER,PARTY,SYMBOL,GENDER,CRIMINAL CASES,AGE,CATEGORY,EDUCATION,ASSETS,LIABILITIES,GENERAL VOTES,POSTAL VOTES,TOTAL VOTES,OVER TOTAL ELECTORS IN CONSTITUENCY,OVER TOTAL VOTES POLLED IN CONSTITUENCY,TOTAL ELECTORS
0,Telangana,ADILABAD,SOYAM BAPU RAO,1,BJP,Lotus,MALE,52.0,52.0,ST,12th Pass,"Rs 30,99,414\n ~ 30 Lacs+","Rs 2,31,450\n ~ 2 Lacs+",376892,482,377374,25.330684,35.468248,1489790
1,Telangana,ADILABAD,Godam Nagesh,0,TRS,Car,MALE,0.0,54.0,ST,Post Graduate,"Rs 1,84,77,888\n ~ 1 Crore+","Rs 8,47,000\n ~ 8 Lacs+",318665,149,318814,21.399929,29.96437,1489790
2,Telangana,ADILABAD,RATHOD RAMESH,0,INC,Hand,MALE,3.0,52.0,ST,12th Pass,"Rs 3,64,91,000\n ~ 3 Crore+","Rs 1,53,00,000\n ~ 1 Crore+",314057,181,314238,21.092771,29.534285,1489790
3,Telangana,ADILABAD,NOTA,0,NOTA,,,,,,,,,13030,6,13036,0.875023,1.225214,1489790
4,Uttar Pradesh,AGRA,Satyapal Singh Baghel,1,BJP,Lotus,MALE,5.0,58.0,SC,Doctorate,"Rs 7,42,74,036\n ~ 7 Crore+","Rs 86,06,522\n ~ 86 Lacs+",644459,2416,646875,33.383823,56.464615,1937690


In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2263 entries, 0 to 2262
Data columns (total 19 columns):
 #   Column                                    Non-Null Count  Dtype  
---  ------                                    --------------  -----  
 0   STATE                                     2263 non-null   object 
 1   CONSTITUENCY                              2263 non-null   object 
 2   NAME                                      2263 non-null   object 
 3   WINNER                                    2263 non-null   int64  
 4   PARTY                                     2263 non-null   object 
 5   SYMBOL                                    2018 non-null   object 
 6   GENDER                                    2018 non-null   object 
 7   CRIMINAL CASES                            2018 non-null   object 
 8   AGE                                       2018 non-null   float64
 9   CATEGORY                                  2018 non-null   object 
 10  EDUCATION                           

In [9]:
df.describe()

Unnamed: 0,WINNER,AGE,GENERAL VOTES,POSTAL VOTES,TOTAL VOTES,OVER TOTAL ELECTORS IN CONSTITUENCY,OVER TOTAL VOTES POLLED IN CONSTITUENCY,TOTAL ELECTORS
count,2263.0,2018.0,2263.0,2263.0,2263.0,2263.0,2263.0,2263.0
mean,0.238179,52.273538,261599.1,990.710561,262589.8,15.811412,23.190525,1658016.0
std,0.426064,11.869373,254990.6,1602.839174,255982.2,14.962861,21.564758,314518.7
min,0.0,25.0,1339.0,0.0,1342.0,0.097941,1.000039,55189.0
25%,0.0,43.25,21034.5,57.0,21162.5,1.296518,1.899502,1530014.0
50%,0.0,52.0,153934.0,316.0,154489.0,10.510553,16.221721,1679030.0
75%,0.0,61.0,485804.0,1385.0,487231.5,29.468185,42.590233,1816857.0
max,1.0,86.0,1066824.0,19367.0,1068569.0,51.951012,74.411856,3150313.0


In [10]:
# CHecking to see if the dataset contains any null values. We need to exclude NOTA votes while checking it.
df_NOTA = df[df['PARTY']!= 'NOTA']
df_NOTA.isna().sum()

STATE                                       0
CONSTITUENCY                                0
NAME                                        0
WINNER                                      0
PARTY                                       0
SYMBOL                                      0
GENDER                                      0
CRIMINAL CASES                              0
AGE                                         0
CATEGORY                                    0
EDUCATION                                   0
ASSETS                                      0
LIABILITIES                                 0
GENERAL VOTES                               0
POSTAL VOTES                                0
TOTAL VOTES                                 0
OVER TOTAL ELECTORS  IN CONSTITUENCY        0
OVER TOTAL VOTES POLLED  IN CONSTITUENCY    0
TOTAL ELECTORS                              0
dtype: int64

In [11]:
df.shape

(2263, 19)

In [12]:
# Dropping irrelevant columns from the dataset
df.drop(['SYMBOL', 'ASSETS', 'LIABILITIES'], axis=1, inplace=True)

In [13]:
df.shape

(2263, 16)

In [14]:
# We need to exclude NOTA in order to get proper predictions
df = df[df['PARTY']!= 'NOTA']
df.shape

(2018, 16)

In [15]:
df['EDUCATION'].unique()

array(['12th Pass', 'Post Graduate', 'Doctorate', 'Graduate', 'Others',
       '10th Pass', '8th Pass', 'Graduate Professional', 'Literate',
       'Illiterate', '5th Pass', 'Not Available', 'Post Graduate\n'],
      dtype=object)

In [16]:
# Removing the \n from 'Post Graduate\n'
df['EDUCATION'].replace(to_replace='Post Graduate\n', value='Post Graduate', inplace=True)

In [17]:
# 'Graduate Professional' are Graduates, so replacing 'Graduate Professional' with 'Graduate'
df['EDUCATION'].replace(to_replace='Graduate Professional', value='Graduate', inplace=True)

In [18]:
df['EDUCATION'].unique()

array(['12th Pass', 'Post Graduate', 'Doctorate', 'Graduate', 'Others',
       '10th Pass', '8th Pass', 'Literate', 'Illiterate', '5th Pass',
       'Not Available'], dtype=object)

In [19]:
df['AGE'] = df['AGE'].astype(int, errors='raise')

In [20]:
df.dtypes

STATE                                        object
CONSTITUENCY                                 object
NAME                                         object
WINNER                                        int64
PARTY                                        object
GENDER                                       object
CRIMINAL CASES                               object
AGE                                           int32
CATEGORY                                     object
EDUCATION                                    object
GENERAL VOTES                                 int64
POSTAL VOTES                                  int64
TOTAL VOTES                                   int64
OVER TOTAL ELECTORS  IN CONSTITUENCY        float64
OVER TOTAL VOTES POLLED  IN CONSTITUENCY    float64
TOTAL ELECTORS                                int64
dtype: object

In [21]:
df['CRIMINAL CASES'].replace(to_replace='Not Available', value=0, inplace=True)
df['CRIMINAL CASES'] = df['CRIMINAL CASES'].fillna(0)
df['CRIMINAL CASES'] = df['CRIMINAL CASES'].astype(int, errors='raise')

In [22]:
df.dtypes

STATE                                        object
CONSTITUENCY                                 object
NAME                                         object
WINNER                                        int64
PARTY                                        object
GENDER                                       object
CRIMINAL CASES                                int32
AGE                                           int32
CATEGORY                                     object
EDUCATION                                    object
GENERAL VOTES                                 int64
POSTAL VOTES                                  int64
TOTAL VOTES                                   int64
OVER TOTAL ELECTORS  IN CONSTITUENCY        float64
OVER TOTAL VOTES POLLED  IN CONSTITUENCY    float64
TOTAL ELECTORS                                int64
dtype: object

### 5.1 Lok Sabha 2019: Election Results

In [23]:
# Number of seats won by each party
result = df[df['WINNER'] == 1].groupby('PARTY')['WINNER'].size()
result_df = pd.DataFrame(data=result).sort_values(by="WINNER", ascending=False)
result_df.reset_index(level=0, inplace=True)
result_df

Unnamed: 0,PARTY,WINNER
0,BJP,300
1,INC,52
2,DMK,23
3,AITC,22
4,YSRCP,22
5,SHS,18
6,JD(U),16
7,BJD,11
8,BSP,11
9,TRS,9


In [24]:
# Visualize the party-wise election results
result_fig = px.bar(result_df, x=result_df['PARTY'], y=result_df["WINNER"], color='WINNER', height=500)
result_fig.show()

### 5.2 Party-wise Vote Share

In [25]:
# Create a dataframe with total number of votes won by each party
result = df[df['WINNER'] == 1].groupby('PARTY')['WINNER'].size()
vote_Share = df.groupby('PARTY')['TOTAL VOTES'].sum()
vote_Share_df = pd.DataFrame(data=vote_Share, index=vote_Share.index).sort_values(by="TOTAL VOTES", ascending=False)
vote_Share_df.reset_index(level=0, inplace=True)
vote_Share_df

Unnamed: 0,PARTY,TOTAL VOTES
0,BJP,228938556
1,INC,119418722
2,AITC,24832104
3,BSP,20808194
4,SP,15616282
...,...,...
127,AKBMP,10127
128,ABSKP,9912
129,BBMP,9894
130,BARESP,9565


In [26]:
# Create a method to label parties as 'Other' if they are not top five in total number of votes.
vote_share_top5 = df.groupby('PARTY')['TOTAL VOTES'].sum().nlargest(5).index.tolist()
def sort_party(data):
    if data['PARTY'] not in vote_share_top5:
        return 'Other'
    else:
        return data['PARTY']
df['Party New'] = df.apply(sort_party, axis=1)
vote_count = df.groupby('Party New')['TOTAL VOTES'].sum()
vote_count_fig = go.Figure(go.Pie(labels=vote_count.index, 
                                  values=vote_count.values, 
                                  marker=dict(line=dict(color="#000000", 
                                                        width=1))))
vote_count_fig.update_layout(title_text='Party-wise Vote Share')
vote_count_fig.show()

### 5.3 Candidate Age Distribution

In [27]:
# Distribution of age
fig = px.histogram(df, x="AGE")
fig.show()

## 6. Preparing the data

### 6.1 Resampling the dataset

In [28]:
# Creating a dataframe of total counts of the winners and losers
total_winner = df[df['WINNER'] == 1]
total_loser = df[df['WINNER'] == 0]
total_results = df['WINNER'].value_counts().reset_index()
total_results.columns = ['Result', 'Total']
total_results

Unnamed: 0,Result,Total
0,0,1479
1,1,539


In [29]:
# Visualize the counts of total winners and losers
total_results_fig = px.bar(total_results, x=["Loser", "Winner"], y=total_results['Total'], color='Total')
total_results_fig.show()

In [30]:
from sklearn.utils import resample
print(len(total_winner), len(total_loser))

539 1479


In [31]:
# Upsampling the data
df_winner_upsample = resample(total_winner, replace=True, n_samples = 1479, random_state=42)
print(len(df_winner_upsample), len(total_loser))

1479 1479


In [32]:
upsampled_dataset = pd.concat([total_loser, df_winner_upsample])

### 6.2 Scaling the data

In [33]:
upsampled_dataset.columns

Index(['STATE', 'CONSTITUENCY', 'NAME', 'WINNER', 'PARTY', 'GENDER',
       'CRIMINAL CASES', 'AGE', 'CATEGORY', 'EDUCATION', 'GENERAL VOTES',
       'POSTAL VOTES', 'TOTAL VOTES', 'OVER TOTAL ELECTORS  IN CONSTITUENCY',
       'OVER TOTAL VOTES POLLED  IN CONSTITUENCY', 'TOTAL ELECTORS',
       'Party New'],
      dtype='object')

In [34]:
# This is the dataset which we will use for Machine Learning
prediction_df = upsampled_dataset.drop(['Party New', 'NAME'], axis=1)

In [35]:
prediction_df.head()

Unnamed: 0,STATE,CONSTITUENCY,WINNER,PARTY,GENDER,CRIMINAL CASES,AGE,CATEGORY,EDUCATION,GENERAL VOTES,POSTAL VOTES,TOTAL VOTES,OVER TOTAL ELECTORS IN CONSTITUENCY,OVER TOTAL VOTES POLLED IN CONSTITUENCY,TOTAL ELECTORS
1,Telangana,ADILABAD,0,TRS,MALE,0,54,ST,Post Graduate,318665,149,318814,21.399929,29.96437,1489790
2,Telangana,ADILABAD,0,INC,MALE,3,52,ST,12th Pass,314057,181,314238,21.092771,29.534285,1489790
5,Uttar Pradesh,AGRA,0,BSP,MALE,0,47,SC,Post Graduate,434199,1130,435329,22.46639,37.999125,1937690
6,Uttar Pradesh,AGRA,0,INC,FEMALE,0,54,SC,Post Graduate,44877,272,45149,2.330042,3.940979,1937690
8,Maharashtra,AHMADNAGAR,0,NCP,MALE,1,34,GENERAL,Graduate,419364,3822,423186,22.734872,35.087431,1861396


In [40]:
from sklearn.preprocessing import StandardScaler
categorical_features = ["STATE", 'CONSTITUENCY', 'PARTY', 'GENDER', 'CATEGORY', 'EDUCATION']
numerical_features = ['CRIMINAL CASES','AGE','TOTAL VOTES','TOTAL ELECTORS']
standardScaler = StandardScaler()
prediction_df1 = pd.get_dummies(prediction_df, columns = categorical_features)

In [41]:
prediction_df1.head()

Unnamed: 0,WINNER,CRIMINAL CASES,AGE,GENERAL VOTES,POSTAL VOTES,TOTAL VOTES,OVER TOTAL ELECTORS IN CONSTITUENCY,OVER TOTAL VOTES POLLED IN CONSTITUENCY,TOTAL ELECTORS,STATE_Andaman & Nicobar Islands,...,EDUCATION_12th Pass,EDUCATION_5th Pass,EDUCATION_8th Pass,EDUCATION_Doctorate,EDUCATION_Graduate,EDUCATION_Illiterate,EDUCATION_Literate,EDUCATION_Not Available,EDUCATION_Others,EDUCATION_Post Graduate
1,0,0,54,318665,149,318814,21.399929,29.96437,1489790,0,...,0,0,0,0,0,0,0,0,0,1
2,0,3,52,314057,181,314238,21.092771,29.534285,1489790,0,...,1,0,0,0,0,0,0,0,0,0
5,0,0,47,434199,1130,435329,22.46639,37.999125,1937690,0,...,0,0,0,0,0,0,0,0,0,1
6,0,0,54,44877,272,45149,2.330042,3.940979,1937690,0,...,0,0,0,0,0,0,0,0,0,1
8,0,1,34,419364,3822,423186,22.734872,35.087431,1861396,0,...,0,0,0,0,1,0,0,0,0,0


In [44]:
prediction_df1[numerical_features] = standardScaler.fit_transform(prediction_df1[numerical_features])

In [45]:
prediction_df1.head()

Unnamed: 0,WINNER,CRIMINAL CASES,AGE,GENERAL VOTES,POSTAL VOTES,TOTAL VOTES,OVER TOTAL ELECTORS IN CONSTITUENCY,OVER TOTAL VOTES POLLED IN CONSTITUENCY,TOTAL ELECTORS,STATE_Andaman & Nicobar Islands,...,EDUCATION_12th Pass,EDUCATION_5th Pass,EDUCATION_8th Pass,EDUCATION_Doctorate,EDUCATION_Graduate,EDUCATION_Illiterate,EDUCATION_Literate,EDUCATION_Not Available,EDUCATION_Others,EDUCATION_Post Graduate
1,0,-0.186473,0.076764,318665,149,-0.257567,21.399929,29.96437,-0.541263,0,...,0,0,0,0,0,0,0,0,0,1
2,0,0.133659,-0.094673,314057,181,-0.274618,21.092771,29.534285,-0.541263,0,...,1,0,0,0,0,0,0,0,0,0
5,0,-0.186473,-0.523267,434199,1130,0.176567,22.46639,37.999125,0.868686,0,...,0,0,0,0,0,0,0,0,0,1
6,0,-0.186473,0.076764,44877,272,-1.277242,2.330042,3.940979,0.868686,0,...,0,0,0,0,0,0,0,0,0,1
8,0,-0.079762,-1.637611,419364,3822,0.131322,22.734872,35.087431,0.628519,0,...,0,0,0,0,1,0,0,0,0,0


## 7. Machine learning model experimentation

In [46]:
# Splitting the data into X and y
X = prediction_df1.drop('WINNER', axis=1)
y = prediction_df1['WINNER']

# Splitting the data into X_train, X_test, y_train, y_test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [47]:
# Importing the machine learning model
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier()

# Fitting the model into machine learning model
classifier.fit(X_train, y_train)

RandomForestClassifier()

In [48]:
# Scoring the machine learning model
classifier.score(X_test, y_test)

0.9797297297297297

In [55]:
from sklearn.svm import SVC
classifier = SVC()
classifier.fit(X_train, y_train)

SVC()

In [56]:
classifier.score(X_test, y_test)

0.875

In [57]:
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression()
classifier.fit(X_train, y_train)

LogisticRegression()

In [58]:
classifier.score(X_test, y_test)

0.518581081081081