# Machine Learning Projects

## Machine learning checklist
### 1. Frame the problem and look at the big picture
- Today I'm going to look at predicting gender based on first name
- Something like this might be valuable to organizations to understand what gender donates more, volunteers more ...
- This might provide insight as to what gender to target for certain campaigns etc.

### 2. Get the data and explore it
- I took the data from here https://www.ssa.gov/oact/babynames/names.zip
- It contains all names from 1800-2016 with their gender

### 3. Prepare the data
- deal with incomplete data
- clean the features so your machine learning algorithms can process it
- Feature selection
- Feature scaling (normalizing features, standardize features etc.)

### 4. Short list some promising models
- If the data is huge training time might be an issue
- you might need to reduce the amount of training data so you can compare many models
- Research if something similar has been done before and what models were used
- Train your models and cross validate against each other

### 5. Fine tune your system
- Use as much data as possible for this step
- Fine tune hyperparameters

In [3]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score

In [4]:
# laod in the dataset
data = pd.read_csv('yob2016.txt')

In [6]:
data.columns = ['name', 'gender', 'freq']

In [7]:
data.head()

Unnamed: 0,name,gender,freq
0,Olivia,F,19246
1,Ava,F,16237
2,Sophia,F,16070
3,Isabella,F,14722
4,Mia,F,14366


## Input Output Features
- We need to decide on input features
- Our output features we're trying to predict is the gender

## A simple approach
- Let's look at the frequency of each letter in the name
- eg. abbac -> [2, 2, 1, 0, ...,0]
- This gives us an input vector with 26 features (most of the array will be zeroes)

In [8]:
def name_to_freq_vector(name):
    lower_name = name.lower()
    arr = np.zeros(26)
    for character in lower_name:
        arr[ord(character) - ord('a')] += 1
    return arr

In [10]:
# Let's process the names
name_vectors = []
for name in data.name:
    name_vectors.append(name_to_freq_vector(name))

# Take a look at the name_vectors array
name_vectors = np.array(name_vectors)
print(data.name[0])
print(name_vectors[0])

Olivia
[ 1.  0.  0.  0.  0.  0.  0.  0.  2.  0.  0.  1.  0.  0.  1.  0.  0.  0.
  0.  0.  0.  1.  0.  0.  0.  0.]


In [14]:
# Let's process the gender
# Sci kit learn comes built in with a label encoder
# this will convert an M to a 1 and any F to a 0
# it will retain these mappings for you
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
gender = le.fit_transform(data.gender).reshape(-1)
gender = np.array(gender)
print(le.classes_)
print(gender.shape)

['F' 'M']
(32867,)


In [18]:
gender.shape

(32867,)

In [16]:
x_train, x_test, y_train, y_test = train_test_split(name_vectors, gender, test_size=0.33)

In [17]:
x_train.shape

(22020, 26)

## Machine Learning
- Now that we've prepared our data we can try out some machine learning 
algorithms
- This process is actually iterative
- As you'll see we will likely have to go back and examine our input features, potentially removing features or adding others

In [20]:
from sklearn.tree import DecisionTreeClassifier

# Fit the classifiier
clf = DecisionTreeClassifier()
clf.fit(x_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

In [21]:
# Lets evalutate the accuracy
from sklearn.metrics import accuracy_score
y_pred = clf.predict(x_test)

In [22]:
print(accuracy_score(y_test, y_pred))

0.61934175348


# Results
- Our first attempt is ok
- 63% is better than 50%
- There is definitely a trend but we can do better

# Cross Validation
- I want to evaluate more models
- But we can't fit them all against the test data and choose the best one
- This may lead to overfitting
- Test data should be saved until the very end once you've picked your model

# Algorithms
- Support vector machines
- Logistic Classifier
- Random Forests (we haven't touched on this but it's essentially a decission tree classifier with extra randomness)

In [23]:
from sklearn.model_selection import cross_val_score

In [24]:
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

In [26]:
# Logistic regression
clf = LogisticRegression()
cross_val_score(clf, x_train, y_train, cv=2)

array([ 0.68694941,  0.68816423])

In [27]:
clf = RandomForestClassifier()
cross_val_score(clf, x_train, y_train, cv=2)

array([ 0.66597039,  0.65927877])

In [28]:
clf = SVC()
cross_val_score(clf, x_train, y_train, cv=2)

array([ 0.70447734,  0.70678536])

# Next Steps
## Input Features
- It looks like support vector machines are performing the best
- It's probably best to go back and re-examine your input features
- Maybe we take order into account
- Maybe you just look at the first letter
- Maybe you just look at the last letter

## Fine Tune your model
- Once you've decided on a model it's time to fine tune it
- Experiment with different kernels

## Testing
- Only once you've decided on your model and your parameters you can test it on the test data
- This will give you an accurate prediction of the model accuracy

In [29]:
clf = SVC(kernel='sigmoid')
cross_val_score(clf, x_train, y_train, cv=2)

array([ 0.63727182,  0.63084749])

In [30]:
clf.fit(x_train, y_train)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='sigmoid',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [31]:
y_pred = clf.predict(x_test)
accuracy_score(y_test, y_pred)

0.61298054761685261