# 期中專案
* 資料來源：https://archive.ics.uci.edu/ml/datasets/Contraceptive+Method+Choice
* 資料敘述：
           1. Wife's age                     (numerical)
           2. Wife's education               (categorical)      1=low	2	3	 4=high
           3. Husband's education            (categorical)      1=low	2	3	 4=high
           4. Number of children ever born   (numerical)			
           5. Wife's religion                (binary)           0=Non-Islam	 1=Islam		
           6. Wife's now working?            (binary)           0=Yes	 1=No		
           7. Husband's occupation           (categorical)      1	2	3	4
           8. Standard-of-living index       (categorical)      1=low	2	3	 4=high
           9. Media exposure                 (binary)           0=Good	 1=Not good		
           10. Contraceptive method used     (class attribute)  1=No-use 			
                                                                2=Long-term			
                                                                3=Short-term		
## 題目2：根據已婚婦女的年齡、教育程度、信仰、工作、丈夫教育程度、職業、孩子個數、生活指數、媒體曝光度等特徵預測婦女的避孕方法選擇（不使用，長期方法或短期方法）。

In [1]:
%matplotlib inline

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import matplotlib as mpl
import seaborn as sns
mpl.rc('font', family='Noto Sans CJK TC')
import statsmodels.api as sm
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

In [2]:
columnNames = ['W_age', 'W_education', 'H_education', 'ChildrenNum', 'W_religion', 'W_working', 'H_occupation', 'Living_index', 'Media_exposure', 'Contraceptive_method']

In [3]:
df = pd.read_csv("cmcdata.csv",names=columnNames)

In [4]:
df.head()

Unnamed: 0,W_age,W_education,H_education,ChildrenNum,W_religion,W_working,H_occupation,Living_index,Media_exposure,Contraceptive_method
0,24,2,3,3,1,1,2,3,0,1
1,45,1,3,10,1,1,3,4,0,1
2,43,2,3,7,1,1,3,4,0,1
3,42,3,2,9,1,1,3,3,0,1
4,36,3,3,8,1,1,3,2,0,1


In [5]:
df.isnull().any()

W_age                   False
W_education             False
H_education             False
ChildrenNum             False
W_religion              False
W_working               False
H_occupation            False
Living_index            False
Media_exposure          False
Contraceptive_method    False
dtype: bool

In [6]:
x = df.iloc[:,:9]
y = df['Contraceptive_method']

In [7]:
x = x.values
y = y.values

In [8]:
x_train,x_test,y_train,y_test = train_test_split(x, y, test_size=0.2, random_state = 87)

## 法一 : 線性回歸

In [9]:
regr = LinearRegression()

In [10]:
regr.fit(x_train,y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [11]:
y_pred1 = regr.predict(x_test)

## 準確率約為11.4%，不甚理想

In [12]:
regr.score(x_test,y_test)

0.11396234914025405

## 法二 : Statsmodels

In [13]:
result = sm.OLS(y_train,x_train).fit()

In [14]:
y_pred2 = result.predict(x_test)

## 準確率約為7%，不甚理想

In [15]:
r2_score(y_test, y_pred2)

0.06998416487662629

In [16]:
print(result.summary())

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.838
Model:                            OLS   Adj. R-squared:                  0.837
Method:                 Least Squares   F-statistic:                     674.1
Date:                Sat, 13 Apr 2019   Prob (F-statistic):               0.00
Time:                        14:35:22   Log-Likelihood:                -1471.2
No. Observations:                1178   AIC:                             2960.
Df Residuals:                    1169   BIC:                             3006.
Df Model:                           9                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
x1            -0.0168      0.003     -5.170      0.0

## 法三：SVM

In [17]:
clf = SVC()

In [18]:
clf.fit(x_train,y_train)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [19]:
y_pred3 = clf.predict(x_test)

## 準確率約為57%

In [20]:
accuracy_score(y_test,y_pred3)

0.5694915254237288

In [21]:
confusion_matrix(y_test,y_pred3)

array([[76, 11, 33],
       [18, 28, 19],
       [30, 16, 64]], dtype=int64)

## 由於個人認為信仰與丈夫職業分類應與避孕方法較無相關，因此將兩因素拿掉再以SVM預測一次

In [22]:
x_reduce = df.iloc[:,[0,1,2,3,5,7,8]]

In [23]:
x_reduce = x_reduce.values

In [24]:
x_reduce_train,x_reduce_test,y_reduce_train,y_reduce_test = train_test_split(x_reduce, y, test_size=0.2, random_state = 87)

In [25]:
clf.fit(x_reduce_train,y_reduce_train)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

## 準確率為60%，較考慮所有因素的情況上升了3%

In [26]:
accuracy_score(y_reduce_test,clf.predict(x_reduce_test))

0.6

In [27]:
confusion_matrix(y_reduce_test,clf.predict(x_reduce_test))

array([[80,  6, 34],
       [15, 26, 24],
       [30,  9, 71]], dtype=int64)

## 結論：可以知道女性在選擇避孕方式時，與年齡、教育程度、工作、丈夫教育程度、孩子個數、生活指數、媒體曝光度較為相關。