# Contraceptive Method Choice
Lab Assignment Four: Extending Logistic Regression

**_Jake Oien, Seung Ki Lee, Jenn Le_**

## Data Preparation and Overview

### Business Case

### Class Variables

Dataset Source: https://archive.ics.uci.edu/ml/datasets/Contraceptive+Method+Choice

This dataset is a subset of the 1987 National Indonesia Contraceptive Prevalence Survey. The samples are married women who were either not pregnant or do not know if they were at the time of interview. The problem is to predict the current contraceptive method choice (no use, long-term methods, or short-term methods) of a woman based on her demographic and socio-economic characteristics.

1. Wife's age (numerical) 
2. Wife's education (categorical) 1=low, 2, 3, 4=high 
3. Husband's education (categorical) 1=low, 2, 3, 4=high 
4. Number of children ever born (numerical) 
5. Wife's religion (binary) 0=Non-Islam, 1=Islam 
6. Wife's now working? (binary) 0=Yes, 1=No 
7. Husband's occupation (categorical) 1, 2, 3, 4 
8. Standard-of-living index (categorical) 1=low, 2, 3, 4=high 
9. Media exposure (binary) 0=Good, 1=Not good 
10. Contraceptive method used (class attribute) 1=No-use, 2=Long-term, 3=Short-term

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Here, we'll import the data and add column names
col_names = ["wife_age", 
             "wife_education", 
             "husband_education", 
             "num_children", 
             "wife_practices_islam", 
             "wife_working", 
             "husband_occupation",
             "sol_index",  # standard of living index
             "media_exposure",
             "contraceptive_method"
            ]
data = pd.read_csv("./cmc.csv", header=None, names=col_names, encoding='latin-1')

In [None]:
data.info()
data.head(10)

First, all of the data exists; there are no non-null values. 

Second, the `husband_occupation` field does not have a good description for the data. The only information about it is that it is a categorical field that can contain the values 1,2,3,4. Without more information, we will remove `husband_occupation` from the dataset. 

Third, the `wife_working` field has some non-intuitive inverted logic in it. In the dataset given, a 0 means working and a 1 means not working. We will flip those around so that now the dataset means that a 1 means working and a 0 means not working. Similarly, the `media_exposure` column uses a 0 to mean "Good" and a 1 to mean "Not good". We will also flip those values so that the lower number refers to a worse category. Also, the wife_working column has the same issue. 0 means working but we will make it mean not working. 

Finally, because our dataset is not large and the prediction label is not a particularly long string, we'll change the value of the `contraceptive_method` column to the text values of the classifier. 

In [None]:
# remove husband_occupation
if "husband_occupation" in data:
    del data["husband_occupation"]
    
# flip 1s and 0s    1-1 = 0, 1-0 = 1
data["wife_working"] = 1-data["wife_working"]
data["media_exposure"] = 1-data["media_exposure"]
data["wife_working"] = 1-data["wife_working"]

# data["contraceptive_method"] = data["contraceptive_method"].map({1: "No use", 2: "Short-term", 3: "Long-term"})

In [None]:
data.head()

There are a few things we can look at regarding this data. To start, let's take a look at contraceptive use vs the number of children the wife has. 

In [None]:
%matplotlib inline
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
import warnings
warnings.simplefilter('ignore', DeprecationWarning)

data_grouped_by_contraceptive_method = data.groupby(by=["contraceptive_method"])
num_children_by_contraceptive_method = pd.crosstab([data["num_children"]], data.contraceptive_method)

print (num_children_by_contraceptive_method)

num_children_by_contraceptive_method.plot(kind="bar", stacked=True)

# fig, ax = plt.subplots()

# sns.distplot(data[data.contraceptive_method == 1].num_children, label="No use", ax=ax)
# sns.distplot(data[data.contraceptive_method == 2].num_children, label="Short-term use", ax=ax)
# sns.distplot(data[data.contraceptive_method == 3].num_children, label="Long-term use", ax=ax)

# plt.title("")
# plt.legend();

# sns.distplot()



In [None]:
num_children_percentage = num_children_by_contraceptive_method.div(num_children_by_contraceptive_method.sum(axis=1).astype(float),
                             axis=0) # normalize the value

num_children_percentage.plot(kind="barh", stacked=True)

We can see that for wives with 0 children, contraceptive use is almost all "No use". As the number of children increases, we see that families with a smaller amount of children (1-3) start to increase in use of long-term contraceptive use. Similarly, after number of children gets to 3, short term contraceptive usage seems to increase in the percentage of the sampled population. 

One might think that contraceptive use would correspond with having less children, because women on birth control have less kids. However, contraceptive usage might actually be a response to family size. At a certain point, a family might decide their family is too big, and start using contraceptives to avoid their family getting larger. 

Another point to take a look at is the use of contraceptives related to education levels and standards of living. According to the paper found here https://www.jstor.org/stable/2138087, women's contraceptive usage tends to increase as their level of education increases. Additionally, a lower standard of living might mean that a women has less access to contraceptives and hence would tend towards "No use". 

In [None]:
sol = pd.crosstab([data.sol_index], data.contraceptive_method)
print (sol)
# sol_and_education.plot(kind="bar", stacked=True)

sol_percentage = sol.div(sol.sum(axis=1).astype(float),
                             axis=0) # normalize the value

sol_percentage.plot(kind="barh", stacked=True)
plt.title("Percentage of contraceptive usages by standard of living");
plt.xlabel("Percentage of class");

One thing to note from the table here is the increase in number of women surveyed as the standard of living increases. We can see clearly from this graph that as the standard of living goes up, use of contraceptives goes up. Long-term contraceptive usage appears to stay relatively constant. 

Finally, let's look at education and contraceptive use. 

In [None]:
education = pd.crosstab([data.wife_education], data.contraceptive_method)
print (education)
# sol_and_education.plot(kind="bar", stacked=True)

education_percentage = education.div(education.sum(axis=1).astype(float),
                             axis=0) # normalize the value

education_percentage.plot(kind="barh", stacked=True)
plt.title("Percentage of contraceptive usages by education")
plt.xlabel("Percentage of class")

As before, we see that as education level of the woman goes up, "No use" of contraceptive goes down. This is in line with much research on the impact of education on contraceptive use. 

Finally, let's see how all of the other variables relate to one another. 

In [None]:
# plot the correlation matrix using seaborn
cmap = sns.diverging_palette(220, 10, as_cmap=True) # one of the many color mappings
sns.set(style="darkgrid") # one of the many styles to plot using

f, ax = plt.subplots(figsize=(9, 9))

sns.heatmap(data.corr(), cmap=cmap, annot=True, center=0)

f.tight_layout()

From this heatmap we see a few strong correlations. The strongest correlation is that between the husband's education and the wife's education. It would make sense that a couple would tend to have similar education levels, or at least that as one member of the couple trends towards higher education that the other would trend toward higher education. 

Additionally, we see that the next highest correlation is that between wife_age and num_children. Younger women simply haven't had as much time as other women to have children. 

Related to contraceptive method, we see that the correlations above 0.1 include media exposure, and husband/wife education. These are overall pretty weak correlations, but that doesn't mean that the data cannot be predicted. Correlation measurements measure a linear relationship between variables, and they might not actually have an underlying linear relationship. That's what we will find out with logistic regression. 

### Data Division

We will now split the data into a training set and a testing set. We will use an 80/20 split. Why? idk. 

In [None]:
X = data.drop("contraceptive_method", axis=1).as_matrix()
y = data["contraceptive_method"]
print(X.shape)
print(y.shape)

In [None]:
from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)
print("Training set has {} instances".format(len(X_train)))
print("Testing set has {} instances".format(len(X_test)))
print("The whole set has {} instances".format(len(X_train) + len(X_test)))

## Modeling

### Custom Logistic Regression Classifier
_ # Make your implementation of logistic regression compatible with the GridSearchCV function that is part of scikit-learn _

### Training

### Comparison