## Individual work after completing CodeCademy's *Build a Machine Learning Model with Python Skill Path* Course

In this project I will be exploring further questions that may arise from the masculinity survey from <a href="https://fivethirtyeight.com/" target = "_blank">FiveThirtyEight</a>.

The questions I will be exploring are:

* Who is more likely to pay on a date? (sexuality, demographic)
* Are certain beliefs or actions linked to more self-described masculine or feminine individuals?
* How do insecurities change as people grow older?


So far, we have already concluded that, when trying to find a more masculine and less masculine cluster from the subquestions of question 7, we end up with a cluster representing the people that are more likely to do all activities and a second cluster representing the people that are less likely to do any activity. So the two clusters are "people who do things" and "people who don't do things". 

The code below shows the progress so far.

In [15]:
import pandas as pd

survey = pd.read_csv("masculinity.csv")

### Mapping text to numerical values

This is important, as KMeans can only be trained on numerical values. 

In [16]:
cols_to_map = ["q0007_0001", "q0007_0002", "q0007_0003", "q0007_0004",
       "q0007_0005", "q0007_0006", "q0007_0007", "q0007_0008", "q0007_0009",
       "q0007_0010", "q0007_0011"]

for column in cols_to_map:
    survey[column] = survey[column].map({"Never, and not open to it":0, "Never, but open to it":1, "Rarely":2, "Sometimes":3,
                                        "Often":4})
    
for column in cols_to_map:
    print(survey[column].value_counts())

3.0    537
2.0    324
4.0    142
1.0    123
0.0     53
Name: q0007_0001, dtype: int64
3.0    514
2.0    387
4.0    123
1.0    101
0.0     50
Name: q0007_0002, dtype: int64
3.0    364
2.0    339
0.0    224
4.0    166
1.0     85
Name: q0007_0003, dtype: int64
2.0    505
3.0    371
1.0    121
0.0     78
4.0     43
Name: q0007_0004, dtype: int64
0.0    710
1.0    228
2.0    217
3.0     22
4.0      7
Name: q0007_0005, dtype: int64
4.0    427
3.0    384
2.0    155
0.0    102
1.0     91
Name: q0007_0006, dtype: int64
0.0    1001
1.0      60
3.0      41
4.0      39
2.0      31
Name: q0007_0007, dtype: int64
4.0    482
3.0    344
2.0    216
0.0     93
1.0     43
Name: q0007_0008, dtype: int64
3.0    353
2.0    316
4.0    296
0.0     95
1.0     95
Name: q0007_0009, dtype: int64
1.0    464
0.0    355
2.0    189
3.0     97
4.0     58
Name: q0007_0010, dtype: int64
2.0    456
3.0    339
0.0    187
4.0    111
1.0     75
Name: q0007_0011, dtype: int64


### Training the KMeans classifier object

As we can see, the first cluster_center defines a cluster with the "not doers" the second defines a cluster with the "doers". This means that there isn't a clear difference in the *masculinity* of the answers.

In [17]:
from sklearn.cluster import KMeans

rows_to_cluster = survey.dropna(subset = ["q0007_0001", "q0007_0002", "q0007_0003","q0007_0004","q0007_0005","q0007_0008","q0007_0009"])

classifier = KMeans(n_clusters = 2)
classifier.fit(rows_to_cluster[["q0007_0001", "q0007_0002", "q0007_0003","q0007_0004","q0007_0005","q0007_0008","q0007_0009"]])
print("The first cluster center has higher values: ", classifier.cluster_centers_[0])
print("The second cluster center has lower values: ", classifier.cluster_centers_[1])

The first cluster center has higher values:  [1.87830688 1.84391534 0.85185185 1.72486772 0.57142857 2.64021164
 1.97089947]
The second cluster center has lower values:  [2.84548105 2.81632653 2.84110787 2.39941691 0.69387755 3.06997085
 2.90087464]


## Who is more likely to pay on a date?


Before we can ask this question, we have to know what clusters of individuals we can find in self-described masculinity and education-level

Mapping the educ4 and q001 (education and question one) columns to numbers:



In [18]:
survey["educ4"] = survey["educ4"].map({"High school or less":0, "Some college":1, "College or more":2, "Post graduate degree":4})

survey["q0001"] = survey["q0001"].map({"Very masculine": 3, "Somewhat masculine": 2, "Not very masculine":1, "Not at all masculine":0})
print(survey["educ4"].value_counts())
print(survey["q0001"].value_counts())

4.0    389
2.0    373
1.0    307
0.0    119
Name: educ4, dtype: int64
2.0    617
3.0    443
1.0     97
0.0     21
Name: q0001, dtype: int64


In [24]:
rows_to_cluster_2 = survey.dropna(subset=["educ4", "q0001"])

classifier_q0001_educ4 = KMeans(n_clusters = 2, random_state = 20)

classifier_q0001_educ4.fit(rows_to_cluster_2[["educ4", "q0001"]])

print(classifier_q0001_educ4.cluster_centers_)
print(rows_to_cluster_2["q0001"].mean())

[[1.32070707 2.26136364]
 [4.         2.25194805]]
2.258283772302464
