# Assignment Overview

Links to the notes discussed in the video
* [Model Selection Overview](./ModelSelect.pdf)
* [Model Types](./ModelType.pdf)
* [Model Decision Factors](./ModelDecisionFactors.pdf)
* [Generalization Techiques](./Generalization.pdf)

The assignment consists of two parts requiring you to select appropriate models with associated code/text.

1. Determine challenge and relevant model for two distinct situations (fill out this notebook). 
1. Address the data code needed and the model for [car factors](./CarFactors/carfactors.ipynb) contained in the subdirectory, CarFactors.

* ***Check the rubric in Canvas*** to make sure you understand the requirements and the assocated weights grading

# Part 1: Speed Dating Model Selection

You are to explore the data set on speed dating and construct two models that provide some insight such as grouping or predictions.  The models must come from different model areas such as listed as categories in the [ModelTypes](./ModelTypes.pdf) document.  You must justify your answer considering the data and the prediction value.

The data is contained in [SpeedDatingData.csv](SpeedDatingData.csv).  The values are detailed in [SpeedDatingKey.md](./SpeedDatingKey.md).  The directory also contains the original key document - SpeedDatingDataKey.docx but jupyter lab is unable to render it.  You are free to render it outside of jupyter lab if something didn't translater clearly.  The open source tool [pandoc](https://pandoc.org/installing.html) was used to perform the translation.  It is useful for almost any translation and works in all major operating systems

# Model 1

## Outline the challenge 

Based on the dataset the most obvious challenge is predicting match outcomes

### Select the features and their justification 

Ther are many features I think could help a machine learning model determine if a match will be successful or not. \
Age (age): In many relationships age plays a huge factor in success \
Race (race): similar backgrounds may play into compatibility \
Imortance of Race (imprace): Lets the model know if it should consider race a big factor or not for each person\
Field of Study (field_cd): Similar academic interests may help compatability\
Interest Correlation (int_corr): Can point out if the people enjoy the same hobbies\
Preferances(pf_o_att): Gives an overview on what a person prioritizes in a partner.\
Partner assessment(attr_o): Shows what the partner rated the other which matters a lot in a match\
Decision (dec): Determines likelyhood of wanting to see someone again, ie if they enjoyed the night enough to go out a second time\
Like (like): Helps show how much a participant liked their partner\
Match (match): Target variable

### Note necessary feature processing such as getting rid of empty cells etc.

It may be useful to Nromalize certain features such as preference scores incase people didnt fully follow the correct input range. Creating a few new rows may also help speed up or improve the algorithm, these could consist of Preference match (ratio between participant preferance and partners attributes) and Age Difference. For certain variables such as attr_o it may be useful to use a mean or median value to fill in any gaps.

### Model Selection

Outline the rationale for selecting the model noting how its capabilities address your challenge

I think the best model for this probelm would be the Random Forest Model. This is because it is an ensemble of decision trees. It is capale of handling a mix of both numerical and categorical data which is very prevelant in the csv. 

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import pandas as pd

df = pd.read_csv("SpeedDatingData.csv") #Assume this has been preprocessed based on specs mentioned earlier

X = df.drop('match', axis=1) 
y = df['match']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

random_forest_model = RandomForestClassifier(n_estimators=100, max_depth=None, min_samples_split=2, random_state=42) 

random_forest_model.fit(X_train, y_train)

# Model 2

## Outline the challenge

Another challenge that can be solved is analyzing participants preferences and recommending potential matches. While similar to the previous challenge this isn't directly predicting if someone will be a match instead this will be an algorithm that looks at the data and returns a list of potential partners that may end up being a match. 

### Select the features and their justification

Preference Scores (attr1_1): This gives an overview on the partners preferance of all 6 attributes \
shar1_1: Shows if there are shared interests between the individuals which is pretty important in if they will be a good potential partner \
int_corr: How many interests the participants share between them. \
imprace: Does Race matter to the participant, makes a major difference in outcome \
imprelig: Importance of religion, muhc like race it makes a major difference in outcome \
age: What age is the participant, can affect what age they may be looking to date \
race and samerace: Can make a difference, especially if imprace is high \
goal: Why someone attends a speed dating event can affect on the prefered partner \
sports, tvsports, museums, and gaming: All of these can affect how compatable people are, you wouldnt want someone who enjoys gaming to match up with someone who cant stand gaming but would prefer to play sports (Not exclusive but tends to be a factor in a relationship)

### Model Selection

Outline the rationale for selecting the model noting how its capabilities address your challenge

Gradient Boosting is a great choice for this challenge. Gradient Boosting is adept at capturing non-linear relationships and interactions between features. This can help give us very accurate outputs as not many of our features are truly a linear thing. As it iterates on previous trees it may find links between categories that we may not ever have noticed and utilize them to give a good output.

In [1]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
import pandas as pd

df = pd.read_csv("SpeedDatingData.csv") #Assume this has been preprocessed based on specs mentioned earlier

X = compatibility_df.drop('high_compatibility', axis=1) 
y = compatibility_df['high_compatibility']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

gradient_boosting_model = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)

gradient_boosting_model.fit(X_train, y_train)