## Segment 2 

## Import required libraries

In [None]:
import pandas as pd
import os
from pathlib import Path
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from collections import Counter
%matplotlib inline
from sklearn.preprocessing import StandardScaler,OneHotEncoder
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings("ignore")
import numpy as np
import seaborn as sns

## Import Data

In [None]:
file_path = "../Data/Speed Dating Data.csv"
df = pd.read_csv(file_path, encoding="ISO-8859-1")
df.columns[0]

In [None]:
for col in df.columns: 
    print(col)

### Machine Learning: Supervised
Using data such age, education level, race, religion, zipcode, income, goal, date frequency, along with their survey evaluation, we will try to predict if 2 persons could match in a speed dating format.

### Known Category
"Match" vs. "Not Match"

### Data Exploration
 Visualize and Explore the matches, gender and age distribution for the data

 This data frame is then used for further analysis and model building. The data consists of 8000 rows and 195 column.


In [None]:
pd.set_option('max_columns', None)
df.head()

In [None]:
df['gender'].value_counts().plot(kind='bar')
plt.xticks(rotation='horizontal')
plt.ylabel('Count')
plt.xlabel =("Male = 1, Female = 0" )
plt.title('Gender Distribution', fontsize = 16)
plt.show()

Genders are equally distributed and no bias is needed

For fun we can see that there are more not matches than there are matches from our speed dating data experiment. 

In [None]:
# Visualize the matches based on gender
sns.countplot(x="match", hue="gender", data = df)

In [None]:
# age distribution of participants
age = df[np.isfinite(df['age'])]['age']
plt.hist(age.values)
plt.xlabel= ('age')
plt.ylabel('Count')
plt.title('Age Distribution', fontsize = 16)

We see the highest participants are in their mid-twenties to early thirties.

In [None]:
# Visualize data to see how many decisions were yes or no. 
# 0 = no  and 1 = yes 
sns.countplot(x='dec', data=df)

We see there are more "0" decisions made for a match than "1" decisions.

"0" meaing no and "1 meaning yes. This makes sense being we know from previous chart that there were more "0"(no) matches than "1" (yes) matches. 

The dec variable is the participants decision based on if they would like a "match" or "not" with paired partner. Furthering our exploration we know that each decision entered is scored with based on individual scores on the 6 following attributes (attr, sinc, intel, fun, amb, shar). 



### Data Preprocessing
#### Data Cleaning 
We simplified the dataset by droping unwanted columns and creating a new dataframe with the data features we needed to answer the question or perform more analysis on. 

"Can a machine learning model help in predicting and/or imprving the speed dating process?"




In order for us to explore that to be true or false we took this question and decided which data decides whether or not a participant is viewed as a match or not.

We will use this data as our Input data to predict the output of Match or No Match.

Note: we know from the data that the dec attribute is the score given based on the 6 attributes, which are attractiveness, sincerity, intelligence, fun, ambitiousness and shared interests. The like variable is an overall rating and the dec variable is obvious of if  the result of experiment ends in a match or not. 

Lets Create a Simplier Dataset of just the data we need for this analysis. 

In [None]:
# Simplify the  Dataset 
date_df = df[["gender", "age","income","race", "career","dec","attr","sinc", "intel","fun","amb","shar","like", "match"]]

# Check All Columns and Display to see them
# View the new dataset/dataframe
pd.set_option('max_columns', None)
date_df.head()


In [None]:
#Looking at how dec influences a match
sns.boxplot(x="dec", y='match', data=date_df)

Analysing the data visually to show how data is scattered. 

In [None]:
# Analysing the new dataset visually with boxplot
sns.boxplot(data = date_df, width=0.5, fliersize = 5)
sns.set(rc={'figure.figsize':(10,10)})

In [None]:
date_df.shape

 Data Cleaning continued by checking to see how many null values are in each column.

In [None]:
date_df.isnull().sum()

We see several have missing values but we see the income attribute has 4099 missing datapoints, which is half of the entire data so will drop the income column. By dropping this column it won't affect the dataset's integrity. We can work with the others by droping the rows that contain missing values.

In [None]:
date_df = date_df.drop('income',axis=1)
date_df

Data Cleaning continued by checking datatypes, null values and unique values.

In [None]:
date_df.dtypes

In [None]:
# Drop the career attribute as it is an object and not needed to predict what is scored
date_df = date_df.drop("career", axis=1)

In [None]:
# Drop all rows with null values 
date_df = date_df.dropna( axis=0, how='any', thresh=None, subset=None, inplace=False)
date_df.isnull().sum()

We now have a dataframe with no missing values and is numerical and are ready to proceed in our classification model process

In [None]:
date_df.nunique()

In [None]:
date_df.shape

We see that our datatypes are all numerical and we see the unique values for each attribute. 

gender has 2 unique values, 1 (male) and 0 (female) 

age has 22 different values 

dec has 2 unique values, 1 (yes wants to be matched) and 0 (no does not want to be matched) 

The following are unique due to the rating scale of  1-10  for each attribute.

"attr"(attractive),
"sin" (sincere), 
"int"(intelligent),
fun, 
"amb"(ambitious), 
"shar"( shared interests) 

like is the overall rating on participant

match has 2 unique values, 0 (No match) and 1 (Match)

### Visualize Correlation  to show the correlation between two variables, we looked at the 6 characteristics

In [None]:
# To find the correlation among the 6 attributes that are the deciding factor on if it is a match or not 
corr_df = pd.DataFrame(date_df)
corr = date_df.corr().round(3)
print(corr)

In [None]:
import plotly.express as px
fig = px.imshow(corr)
fig.show()

The correlation coefficients along the diagonal of the table are all equal to 1 because each variable is perfectly correlated with itself.The others are close but not past .75. Also shows me that age and race have no real correlation to the other attributes. we will drop age and race as a feature. We also see that dec has high correlation to match because we know the dec is the participants decision based on the 6 characteristics if they want to be matched or not. Note: that both participants have to decide yes for there to be a match. 

In [None]:
# Drop Age
date_df = date_df.drop("age", axis=1)


Summary Statistics of both female and male data

In [None]:
import numpy as np
#Seperate females and view a summary of the females ratings for decision making.. 
females = date_df.loc[date_df["gender"]==0,['gender','dec','attr','race',
                'shar','fun','amb','sinc','intel']]
print(females.describe())

From Summary of the female ratings for decision making we can see that we have a mean of "dec" as .37 This means that a decison of wanting to be matched only occurs between 37% of participant pairs. attr, shar, fun, amb, sinc, and intel correspond to the average of each participants and partners’ ratings of one another. Which we know from our data is 1-10 1 being awful and 10 being great. 

Lets now look at the Males.

In [None]:
# Seperate genders
males = date_df.loc[date_df["gender"]==1,['gender','dec','attr','race',
                'shar','fun','amb','sinc','intel']]
print(males.describe())

From Summary of the male ratings for decision making we can see that we have a mean of "dec" as .48 This means that a decison of wanting to be matched only occurs between 48% of male participants. attr, shar, fun, amb, sinc, and intel correspond to the average of each participants and partners’ ratings of one another. Which we know from our data is 1-10 1 being awful and 10 being great. 

In [None]:
# Visualize the matches based on each race
sns.countplot(x="match", hue="race", data = date_df)

We have data prepared and analysed,  Next step is Feature Selection must select the X and the y

### Feature Selection/Extraction 
Define our features 

X: independent variables: gender,attr, sinc, intel, fun, amb, shar,like and match

y: dependent variable(target/output):  dec: 0 or 1, ( match(1), Not Match (0) )

dec represents wanting to be matched by 1 being a yes and 0 being a no. 


In [None]:
# Create our features  which are the 6 attributes that each particpant scores their partner on. 
X = date_df.drop(labels=['dec',"gender", "race"], axis=1)

# Create our target which is the output we want which is the dec made that will determine a match. 
y = date_df.dec

In [None]:
X.describe()
X.columns

In [None]:
# Check the balance of our target values
y.value_counts()

### Train-Test-Split
Now, that Data is defined, we will split the data. what we want is the training data at (80%) and the Testing Data at (20%) so we can train our model

In [None]:
# Import counter and use train_test_split to split data into training and testing data
from collections import Counter
X_train, X_test, y_train, y_test,  = train_test_split(X, y, random_state=1, stratify=y)

print(Counter(y_train))
print(Counter(y_test))
Counter({'Dec No ': 51352, 'Dec Yes': 260})

### Classification Model 

Building a model with several different algorithms to decide which one performs bes then we will choose a model and tune parameters to check for better performance. Her we used Logistic Regression, 

Train the Logistic Regression 

#### Logistic Regression Algorithm  was choses because the dependent variable is has two outbuts (binary). . It is a predictive analysis and used to describe data and explain the realationship of the one dependent variable and oe or more independent variables. Will there be a match or not based on the independent variables. Uses the sigmoid function, which will always give values between 0 and 1. 

In [None]:
import statsmodels.api as sma
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(solver="lbfgs",random_state=1)
model
#train the data
model.fit(X_train, y_train)


In [None]:
# Calculate Balanced accuracy score
y_pred = model.predict(X_test)
from sklearn.metrics import balanced_accuracy_score
balanced_accuracy_score(y_test, y_pred) 

In [None]:
# Display the confusion matrix
from sklearn.metrics import confusion_matrix
matrix = confusion_matrix(y_test, y_pred)
print(matrix)

In [None]:
# Print the imbalanced classification report
from imblearn.metrics import classification_report_imbalanced
report = classification_report_imbalanced(y_test, y_pred)
print(report)

#### Resample Data to see if their is any change in accuracy

We used a resampling tool to balance our dataset. I resampled the data with  with SMOTEEN because many researchers suggest combining oversampling and undersampling methods to balance the dataset  is better. and re ran the Logistic Regression again to see if different results, Balanced Random Forest, Random Forest. 


In [None]:
# Resample the training data with SMOTEENN
from imblearn.combine import SMOTEENN
smote_enn = SMOTEENN(random_state = 1)
X_resampled,y_resampled = smote_enn.fit_resample(X,y)
print(Counter(y_resampled))

In [None]:
# Train the Logistic Regression model using the resampled data
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(solver="lbfgs",random_state=1)
model

#train the data
model.fit(X_train, y_train)

In [None]:
# Calculate Balanced accuracy score
y_pred = model.predict(X_test)
from sklearn.metrics import balanced_accuracy_score
balanced_accuracy_score(y_test, y_pred) 

In [None]:
# Print the imbalanced classification report
from imblearn.metrics import classification_report_imbalanced
report = classification_report_imbalanced(y_test, y_pred)
print(report)

Resampling the data showed no impovement or decline in the accuracy score. 

The random forest classifier is a supervised learning algorithm which you can use for regression and classification problems. It is among the most popular machine learning algorithms due to its high flexibility and ease of implementation. It is a highly accurate algorithm, which works on the priciple of Decision Tree Classification. 


In [None]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=100, max_depth=2, random_state=0)
rf.fit(X_train, y_train)

y_pred = rf.predict(X_test)

print("Accuracy: " + str(rf.score(X_test, y_test)))

Train with an Ensemble Learner
 Balanced Random Forest Classifier 

In [None]:
# Resample the training data with the BalancedRandomForestClassifier with 100 estimators
from imblearn.ensemble import BalancedRandomForestClassifier
#from sklearn.datasets import make_classification
model = BalancedRandomForestClassifier(n_estimators=100, random_state=1)
#train the data
model.fit(X_train,y_train)

In [None]:
# Calculated the balanced accuracy score
from sklearn.metrics import balanced_accuracy_score
y_pred = model.predict(X_test)
balanced_accuracy_score(y_test, y_pred)

In [None]:
# Display the confusion matrix
from sklearn.metrics import confusion_matrix
matrix = confusion_matrix(y_test, y_pred)
print(matrix)

In [None]:
# Print the imbalanced classification report
from imblearn.metrics import classification_report_imbalanced
report = classification_report_imbalanced(y_test, y_pred)
print(report)

In [None]:
# List the features sorted in descending order by feature importance
# Calculate feature importance in the Random Forest model.
importances = model.feature_importances_
importances
# We can sort the features by their importance.
sorted(zip(model.feature_importances_, X.columns), reverse=True)

The ensemble classifier did not improve our accuracy score however we were able to view the importance of our features We see attractiveness is the most preferred amongst the participants with intelligience being the least of the 6 characteristics the participants score on. 

We received a better score with Random Forest algorithm but in reality they all are around the same with 80% accuracy but Random Forest Classifier algorithm gave us 81% accuracy. 