## AI Algorithm - Exe3

### LOGISTIC REGRESSION WITH PCA 

## Business problem description

This information was collected by Ronny from the 1994 Census bureau database. Using the following criteria, a set of relatively clean records was extracted: ((AAGE>16) && (AGI>100) && (AFNLWGT>1) && (HRSWK>0)). The task of prediction is to decide if a person makes over $50K a year.

### The dataset contains the following features:
#### 1) Age : Age of the person
#### 2) Workclass : Working/labour class of the person
#### 3) fnlwgt : The weights on the files of the Current Population Survey (CPS) are managed by independent estimates of the US civilian noninstitutional society.
#### 4) Education : Education of the person
#### 5) education.num = number of education of the person
#### 6) marital.status = marital status  of the person
#### 7) occupation = occupation of the person
#### 8) relationship = relationship status of the person
#### 9) race =  race of the person 
#### 10) sex = sex of the person
#### 11) capital.gain = profit of a person in stocks and other assets
#### 12) capital.loss = loss of a person in stocks and other assets
#### 13) hours.per.week = how many hours a week, the person works?
#### 14) native.country = what is the native place of the person
#### 15) income = income of that person

    

Income is the dependent variable as we need to predict if the person makes over 50,000 dollars in a year.
Other than that, all the other thirteen features are independent features
Our task is to predict whether a person makes over or less than 50,000 dollars in a year

## Data Acquisition

##### We start by importing the dataset and acquiring basic info about it

In [None]:
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder
import pandas as pd
import plotly.express as px
import plotly.graph_objs as go
import warnings
warnings.filterwarnings('ignore')
import missingno as msno
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
df = pd.read_csv('C:\\Users\\Shrita\\Downloads\\exe3\\adult.csv')

In [None]:
df.info()

##### We are dealing with 6 integer columns and the other columns have categorical data which needs to encoded in the future

In [None]:
df.describe()

##  Exploratory data analysis

Lets look at the data 

In [None]:
df.head(7)

In [None]:
df.shape

##### The dataset has 32561 rows and 15 columns

In [None]:
df.isnull().sum()

In [None]:

msno.bar(df);

according to this, the dataset has no null values

In [None]:
df['occupation'].value_counts()

We can see that, '?' indicates missing values. There are almost 1000 rows with '?'. It wont be right to remove them so we will replace them

In [None]:
df[df == '?'] = np.nan

In [None]:

msno.bar(df);

In [None]:
df.isnull().sum()

Now, we can clearly see the missing values. We will replace them with mode of the respective column.

In [None]:
df['workclass'].fillna(df['workclass'].mode()[0], inplace=True)
df['occupation'].fillna(df['occupation'].mode()[0], inplace=True)
df['native.country'].fillna(df['native.country'].mode()[0], inplace=True)

##### Checking null values again

In [None]:
df.isnull().sum()

In [None]:

msno.bar(df);

##### And we are all clear

## Data analysis of the features

In [None]:

labels = ['Over $50k','Below $50k']
sizes = [df['income'].value_counts()[0],
         df['income'].value_counts()[1]]


# print(sizes) # adds up to 1433, which is the total number of participants

fig1 = plt.figure(figsize=(20,9))

ax1=fig1.add_subplot(121)
ax1.pie(sizes, labels=labels, autopct='%1.1f%%', shadow=True, colors = ['#003f5c', '#bc5090'])
ax1.set_title('Income Count of the Dataset',fontsize = 20)
ax1.axis('equal')


plt.show()

Out of all the rows that are being examined, a little over 75% of the total population having the income more than 50,000 dollars per year. And the population having the income lower than 50,000 dollars is 24%

In [None]:

labels = ['Prof-specialty','Craft-repair','Exec-managerial','Adm-clerical','Sales','Other-service','Machine-op-inspct','Transport-moving','Handlers-cleaners','Farming-fishing','Tech-support','Protective-serv','Priv-house-serv','Armed-Forces']
sizes = [df['occupation'].value_counts()[0],
         df['occupation'].value_counts()[1],
         df['occupation'].value_counts()[2],
         df['occupation'].value_counts()[3],
         df['occupation'].value_counts()[4],
         df['occupation'].value_counts()[5],
         df['occupation'].value_counts()[6],
         df['occupation'].value_counts()[7],
         df['occupation'].value_counts()[8],
         df['occupation'].value_counts()[9],
         df['occupation'].value_counts()[10],
         df['occupation'].value_counts()[11],
         df['occupation'].value_counts()[12],
         df['occupation'].value_counts()[13]
        
        ]


# print(sizes) # adds up to 1433, which is the total number of participants

fig1 = plt.figure(figsize=(20,12))

ax1=fig1.add_subplot(121)
ax1.pie(sizes, labels=labels, autopct='%1.1f%%', shadow=True, colors = ['y','#bc5090','khaki','tan','g'])
ax1.set_title('Occupation Count of the dataset',fontsize=20)
ax1.axis('equal')


plt.show()

The occupation that are very common in the dataset are Professional-speciality, Craft repair and Executive managerial postions. And the least common occupations are Armed forces and Private house servant

In [None]:
df['occupation'].value_counts()

In [None]:
df['income'].value_counts()

In [None]:
df['capital.gain'].value_counts()

In [None]:

f= plt.figure(figsize=(15,7))

ax1=f.add_subplot()

sns.kdeplot(df.loc[df['income'] == '<=50K', 'education.num'], label='<=50K', shade = True)
sns.kdeplot(df.loc[df['income'] == '>50K', 'education.num'], label='>=50K', shade = True)

ax1.set_title('Does studying for more years lead to higher income?',fontsize = 20)
ax1.legend(loc = 1)

There is an interesting observation to be noted in this graph. People who had the income lower than 50,000 dollars had an education of 9-10 years. While People who had a salary of more than 50,000 dollars in a year had studied for 12-14.5 years.

In [None]:
f= plt.figure(figsize=(15,7))

ax1=f.add_subplot()
sns.kdeplot(df.loc[df['income'] == '<=50K', 'age'], label='<=50K', shade = True)
sns.kdeplot(df.loc[df['income'] == '>50K', 'age'], label='>=50K', shade = True)
plt.title('Do older people earn more?', fontsize = 18)
ax1.legend(loc = 1)

This graph also reveals an exciting observation. 20-28 year old people had salary of less than 50,000 dollars per year. And people who are aged above 30 till 50 had a higher salary(greater than 50,000 dollars a year)

In [None]:
df['sex'].value_counts()

In [None]:
f= plt.figure(figsize=(15,7))

ax1=f.add_subplot()
sns.kdeplot(df.loc[df['sex'] == 'Male', 'hours.per.week'], label='Male')
sns.kdeplot(df.loc[df['sex'] == 'Female', 'hours.per.week'], label='Female')
ax1.set_title('Do men work more than women?',fontsize = 20)
ax1.legend(loc = 1)

This graph shows a normal distribution. That means working for 30-50 hours is very common whereas, working for less than 20 hours and working for more than 60 hours is very uncommon.
Males work slightly more than women.

In [None]:
f= plt.figure(figsize=(40,9))

ax1=f.add_subplot(121)
sns.countplot(df['race'], hue = df['income'])
plt.title('Racial differences in Income', fontsize = 18)
ax1.legend(loc = 1)


The dataset shows that the income is higher for white and Asian-Pac-Islander compared to other races.

In [None]:
f= plt.figure(figsize=(30,9))

ax1=f.add_subplot(121)
sns.scatterplot(data=df, x="capital.gain", y="hours.per.week", s=100, color=".2", marker="+")


People who work 20-60 hours per week have capital gain of 40k and very rarely they also have capital gain of more than 90. But people who work for less than 20 hours and more than 80 hours has slightly lower gain in capital

In [None]:
f= plt.figure(figsize=(40,9))

ax1=f.add_subplot(121)
sns.countplot(df['workclass'], hue = df['income'])
plt.title('Which workclass will have higher income?', fontsize = 18)
ax1.legend(loc = 1)


Private and self-employed workclass has higher salary compared to other workclasses. Whereas, people who have never worked or are working without salaray have very low income

In [None]:
f= plt.figure(figsize=(40,9))

ax1=f.add_subplot(121)
sns.countplot(df['race'], hue = df['occupation'])
plt.title('Which occupation is more common in respective races', fontsize = 18)
ax1.legend(loc = 1)


It is very common to work in a Professional speciality in white and asian people. Whereas, black people had maximum jobs in the 'other occupations' category.

In [None]:
f= plt.figure(figsize=(45,9))

ax1=f.add_subplot(121)
sns.countplot(df['occupation'], hue = df['income'])
plt.title('Which occupation has higher income?', fontsize = 18)
ax1.legend(loc = 1)


Professional speciality and admistrative clearical jobs had higher than 50k dollars per year of income. We can see that in the sales, cleaners and craft category, people have much lower salary.

In [None]:
f= plt.figure(figsize=(29,7))

ax1=f.add_subplot(121)
sns.barplot(x = 'workclass', y = 'age', data = df)
ax1.set(xlabel='workclass', ylabel='age')
ax1.set_title('Workclass vs Age',fontsize = 20)
for p in ax1.patches:
    height = p.get_height()
    ax1.text(p.get_x() + p.get_width()/2, height + 0.005, '{:1.4f}'.format(height), ha="center") 
    
plt.show()

The class without pay has maximum people aged between 45-50. And most of the people who have never worked are aged below 20. Alos, people who are self-employed are aged between 38-41

In general, all the workclasses like Local and state gov had ages 20-50

In [None]:
f= plt.figure(figsize=(30,6))

ax1=f.add_subplot(121)
sns.countplot(df['sex'], hue = df['income'])
plt.title('Are Men paid more than Women?', fontsize = 18)
ax1.legend(loc = 1)


Men on average have higher salary than Women.

In [None]:
f= plt.figure(figsize=(30,15))

ax1=f.add_subplot(221)
sns.countplot(x='marital.status', hue="income", data=df , palette = 'twilight_shifted')
plt.title('Do seperated people have higher income?', fontsize = 18)
ax1.legend(loc = 1)



plt.show()


We can see that married people had higher income. But the people who were married have lower salary this might be because most people who are under 20 years of age are unmarried.

In [None]:
df['income_int'] = df['income'].map({'<=50K' : 0, '>=50K' : 1})

In [None]:
f= plt.figure(figsize=(40,6))

ax=f.add_subplot(121)
sns.stripplot(data=df,
         x='education',
         y='workclass',
         jitter=True)
ax.set_title('Does education affect working class?')


We can see from the graph that people who have never worked or have jobs that are unpaid have lower quality of education. But they other work classes had masters and doctorate degrees

In [None]:

sns.lmplot(data=df,
           x="hours.per.week",
           y="age", height = 7)
plt.title('Do younger people work more?', fontsize = 18)


People that are older than 20 and younger than 70 work for 30-70 hours per week. On the other hand, Older people had very low working hours in a week. Which means that younger people had more working hours than older people(70-90)

## ML Classifiers and datasets (Training and Test)
### I will break down the process into 3 parts and check respective accuracies

--------PART-1-------------

Dropping the column 'income_int' 

In [None]:
df = df.drop(['income_int'], axis=1)

Dividing the dataset into X and y

In [None]:
X = df.drop(['income'], axis=1)
y = df['income']

In [None]:
X.head()

In [None]:
y.head()

In [None]:
X.shape

In [None]:
y.shape

In [None]:
y

In [None]:
type(y)

## Label Encoding 

As most of the columns have categorical variables we will convert them into integer values first


We will also convert the income column seperately

In [None]:
from sklearn import preprocessing

categorical = ['workclass', 'education', 'marital.status', 'occupation', 'relationship', 'race', 'sex', 'native.country']
for feature in categorical:
        le = preprocessing.LabelEncoder()
        X[feature] = le.fit_transform(X[feature])


In [None]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(y)

In [None]:
y

In [None]:
X

## Feature Scaling

As the columns have different scales in X set, we will bring them down to one scale

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X  = pd.DataFrame(scaler.fit_transform(X))


## Train-test set split

Splitting the X and y sets into train and test sets

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)

##### I want to check the accuracy first without adjusting parametrs and without applying PCA

## Applying logistic regression on the training and test sets

Applying Logistic regression on the dataset

In [None]:
from sklearn.metrics import accuracy_score

logreg = LogisticRegression()
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)

print('Logistic Regression accuracy score with all the features: {0:0.4f}'. format(accuracy_score(y_test, y_pred)))

In [None]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
print(cm)

##### The inital accuracy of the model is 82.17%

## PCA for dimensionality reduction

In [None]:
from sklearn.decomposition import PCA
pca = PCA()
X_train = pca.fit_transform(X_train)
pca.explained_variance_ratio_

PCA has selected 14 components

This means that the 14 components that were selected explain more than 97 percent of the variance

In [None]:
X.shape

In [None]:
X

In [None]:
plt.figure(figsize=(8,6))
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlim(0,14,1)
plt.xlabel('Number of components')
plt.ylabel('Cumulative explained variance')
plt.show()

12 Components explain 90 and more percent of the variance.

In [None]:
X.shape

In [None]:
df1 = pd.DataFrame(data=np.c_[X, y], columns=['Feature 1', 'Feature 2', 'Feature 3','Feature 4','Feature 5','Feature 6','Feature 7', 'Feature 8', 'Feature 9','Feature 10','Feature 11','Feature12','Feature13','Feature14','label'])

In [None]:
df1

In [None]:
df1.label.value_counts()

In [None]:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
df1['label'] = le.fit_transform(df1['label'])

In [None]:
df1

In [None]:
X1 = df1.drop(['label'], axis=1)
y1 = df1['label']

In [None]:
X1

In [None]:
y1.head()

In [None]:
(X_train, X_test, y_train, y_test) = train_test_split(X1, y1, test_size=0.3, random_state=0)

In [None]:
from sklearn.metrics import accuracy_score

logreg = LogisticRegression()
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)

print('Logistic Regression accuracy score with all the features: {0:0.4f}'. format(accuracy_score(y_test, y_pred)))

With the components selected by PCA, I got 82% of acuracy

## What happens when I choose only 11 components

### Now, I will check with only 11 components. Removing the last three columns in X set

In [None]:
X = df.drop(['income','native.country', 'hours.per.week', 'capital.loss'], axis=1)
y = df['income']


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)


categorical = ['workclass', 'education', 'marital.status', 'occupation', 'relationship', 'race', 'sex']
for feature in categorical:
        le = preprocessing.LabelEncoder()
        X_train[feature] = le.fit_transform(X_train[feature])
        X_test[feature] = le.transform(X_test[feature])


X_train = pd.DataFrame(scaler.fit_transform(X_train), columns = X.columns)

X_test = pd.DataFrame(scaler.transform(X_test), columns = X.columns)

logreg = LogisticRegression()
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)

print('Logistic Regression accuracy score with the first 11 features: {0:0.4f}'. format(accuracy_score(y_test, y_pred)))

### Now, I will check with only 12 components. Removing the last two columns in X set

In [None]:
X = df.drop(['income','native.country', 'hours.per.week'], axis=1)
y = df['income']


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)


categorical = ['workclass', 'education', 'marital.status', 'occupation', 'relationship', 'race', 'sex']
for feature in categorical:
        le = preprocessing.LabelEncoder()
        X_train[feature] = le.fit_transform(X_train[feature])
        X_test[feature] = le.transform(X_test[feature])


X_train = pd.DataFrame(scaler.fit_transform(X_train), columns = X.columns)

X_test = pd.DataFrame(scaler.transform(X_test), columns = X.columns)

logreg = LogisticRegression()
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)

print('Logistic Regression accuracy score with the first 11 features: {0:0.4f}'. format(accuracy_score(y_test, y_pred)))

In [None]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
print(cm)

In [None]:
X = df.drop(['income'], axis=1)
y = df['income']


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)


categorical = ['workclass', 'education', 'marital.status', 'occupation', 'relationship', 'race', 'sex', 'native.country']
for feature in categorical:
        le = preprocessing.LabelEncoder()
        X_train[feature] = le.fit_transform(X_train[feature])
        X_test[feature] = le.transform(X_test[feature])


In [None]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
print(cm)

In [None]:
X_train = pd.DataFrame(scaler.fit_transform(X_train), columns = X.columns)


pca= PCA()
pca.fit(X_train)
cumsum = np.cumsum(pca.explained_variance_ratio_)
dim = np.argmax(cumsum >= 0.90) + 1
print('The number of dimensions required to preserve 90% of variance is',dim)

## Conclusion

### If we blindly trust that all the features are correlated to the dependent variable, we get lower accuracy
### PCA helps to not only reduce the dimension, but also selecting the perfect features. The maximum component suggested by PCA was 12. When I tried adding only 12 components to the dataset , I saw a higher accuracy. This was because a total of 7% of variance was explained by the dropped columns. 
### The following code computes PCA without reducing dimensionality, then computes the minimum number of dimensions required to preserve 90% of the training set variance.
### However, If we go lower than 12 component, the dimensions do not  explain 90% of the variance. That is why the accuracy for 11 components was much lower.

### As we need higher accuracy in the model and as suggested by PCA algorithm, I select 12 components in the model and get the final accuracy of 82.27%