# 期中專案
* 資料來源：https://www.kaggle.com/fivethirtyeight/fivethirtyeight-comic-characters-dataset
* 資料敘述：美國漫畫數據集，蒐集了漫畫裡眾多人物的資料，包含身分、性別、角色好壞、是否存活、眼睛顏色、頭髮顏色、出現年份、出現次數等等。
## 題目1：用身分、性別、角色好壞、是否存活、眼睛顏色、頭髮顏色、出現年份這些因素來預測該人物在漫畫裡的出現次數。

In [76]:
%matplotlib inline

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import matplotlib as mpl
mpl.rc('font', family='Noto Sans CJK TC')
import statsmodels.api as sm
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

In [77]:
df = pd.read_csv("dc-wikia-data.csv")

In [78]:
df.head()

Unnamed: 0,page_id,name,urlslug,ID,ALIGN,EYE,HAIR,SEX,GSM,ALIVE,APPEARANCES,FIRST APPEARANCE,YEAR
0,1422,Batman (Bruce Wayne),\/wiki\/Batman_(Bruce_Wayne),Secret Identity,Good Characters,Blue Eyes,Black Hair,Male Characters,,Living Characters,3093.0,"1939, May",1939.0
1,23387,Superman (Clark Kent),\/wiki\/Superman_(Clark_Kent),Secret Identity,Good Characters,Blue Eyes,Black Hair,Male Characters,,Living Characters,2496.0,"1986, October",1986.0
2,1458,Green Lantern (Hal Jordan),\/wiki\/Green_Lantern_(Hal_Jordan),Secret Identity,Good Characters,Brown Eyes,Brown Hair,Male Characters,,Living Characters,1565.0,"1959, October",1959.0
3,1659,James Gordon (New Earth),\/wiki\/James_Gordon_(New_Earth),Public Identity,Good Characters,Brown Eyes,White Hair,Male Characters,,Living Characters,1316.0,"1987, February",1987.0
4,1576,Richard Grayson (New Earth),\/wiki\/Richard_Grayson_(New_Earth),Secret Identity,Good Characters,Blue Eyes,Black Hair,Male Characters,,Living Characters,1237.0,"1940, April",1940.0


In [79]:
df.isnull().any()

page_id             False
name                False
urlslug             False
ID                   True
ALIGN                True
EYE                  True
HAIR                 True
SEX                  True
GSM                  True
ALIVE                True
APPEARANCES          True
FIRST APPEARANCE     True
YEAR                 True
dtype: bool

In [80]:
df2=df

## 拿掉沒有要使用的資料

In [81]:
df2.drop('page_id', axis=1, inplace=True)
df2.drop('name', axis=1, inplace=True)
df2.drop('urlslug', axis=1, inplace=True)
df2.drop('GSM', axis=1, inplace=True)
df2.drop('FIRST APPEARANCE', axis=1, inplace=True)

## 由於身分別和眼睛頭髮顏色資料缺失較多，故手動補上。
### * 身分別:由於在表格裡本就有Identity Unknown，故將空白的部分全部填上身分未知
### * 眼睛頭髮顏色：按照正常人多為黑色，故將眼睛頭髮空白處補上黑色

In [82]:
df2['ID'].fillna('Identity Unknown', inplace=True)
df2['EYE'].fillna('Black Eyes', inplace=True)
df2['HAIR'].fillna('Black Hair', inplace=True)

## 拿掉剩餘有缺失的資料

In [83]:
df3 = df2.dropna()

In [84]:
df3.head()

Unnamed: 0,ID,ALIGN,EYE,HAIR,SEX,ALIVE,APPEARANCES,YEAR
0,Secret Identity,Good Characters,Blue Eyes,Black Hair,Male Characters,Living Characters,3093.0,1939.0
1,Secret Identity,Good Characters,Blue Eyes,Black Hair,Male Characters,Living Characters,2496.0,1986.0
2,Secret Identity,Good Characters,Brown Eyes,Brown Hair,Male Characters,Living Characters,1565.0,1959.0
3,Public Identity,Good Characters,Brown Eyes,White Hair,Male Characters,Living Characters,1316.0,1987.0
4,Secret Identity,Good Characters,Blue Eyes,Black Hair,Male Characters,Living Characters,1237.0,1940.0


## 由於該份資料多為名目變數，故額外將其轉換為-1、0、1值，方便接下來進行預測
## 顏色的部分則根據RGB值將其轉換為三個變數

In [85]:
df3['ID Value'] = df3['ID'].map({'Secret Identity':-1, 'Public Identity':1, 'Identity Unknown':0})
df3['ALIGN Value'] = df3['ALIGN'].map({'Good Characters':1, 'Bad Characters':-1, 'Neutral Characters':0})
df3['SEX Value'] = df3['SEX'].map({'Male Characters':1, 'Female Characters':-1, 'Genderless Characters':0})
df3['ALIVE Value'] = df3['ALIVE'].map({'Living Characters':1, 'Deceased Characters':0, 'Reformed Criminals':-1})
df3['EYE_R Value'] = df3['EYE'].map({'Red Eyes':255, 'Green Eyes':0, 'Blue Eyes':0, 'Brown Eyes':165, 'Black Eyes':0, 'White Eyes':255, 'Grey Eyes':128, 'Yellow Eyes':255, 'Purple Eyes':128, 'Amber Eyes':255, 'Hazel Eyes':218, 'Photocellular Eyes':202, 'Pink Eyes':255, 'Gold Eyes':255, 'Orange Eyes':255, 'Violet Eyes':139, 'Auburn Hair':165})
df3['EYE_G Value'] = df3['EYE'].map({'Red Eyes':0, 'Green Eyes':255, 'Blue Eyes':0, 'Brown Eyes':42, 'Black Eyes':0, 'White Eyes':255, 'Grey Eyes':128, 'Yellow Eyes':255, 'Purple Eyes':0, 'Amber Eyes':191, 'Hazel Eyes':145, 'Photocellular Eyes':133, 'Pink Eyes':192, 'Gold Eyes':215, 'Orange Eyes':165, 'Violet Eyes':0, 'Auburn Hair':42})
df3['EYE_B Value'] = df3['EYE'].map({'Red Eyes':0, 'Green Eyes':0, 'Blue Eyes':255, 'Brown Eyes':42, 'Black Eyes':0, 'White Eyes':255, 'Grey Eyes':128, 'Yellow Eyes':0, 'Purple Eyes':128, 'Amber Eyes':0, 'Hazel Eyes':0, 'Photocellular Eyes':106, 'Pink Eyes':203, 'Gold Eyes':0, 'Orange Eyes':0, 'Violet Eyes':255, 'Auburn Hair':42})
df3['HAIR_R Value'] = df3['HAIR'].map({'Red Hair':255, 'Green Hair':0, 'Blue Hair':0, 'Brown Hair':165, 'Black Hair':0, 'White Hair':255, 'Grey Hair':128, 'Blond Hair':98, 'Purple Hair':128, 'Pink Hair':255, 'Gold Hair':255, 'Orange Hair':255, 'Violet Hair':139, 'Strawberry Blond Hair':247, 'Silver Hair':192, 'Reddish Brown Hair':89, 'Platinum Blond Hair':252})
df3['HAIR_G Value'] = df3['HAIR'].map({'Red Hair':0, 'Green Hair':255, 'Blue Hair':0, 'Brown Hair':42, 'Black Hair':0, 'White Hair':255, 'Grey Hair':128, 'Blond Hair':94, 'Purple Hair':0, 'Pink Hair':192, 'Gold Hair':215, 'Orange Hair':165, 'Violet Hair':0, 'Strawberry Blond Hair':232, 'Silver Hair':192, 'Reddish Brown Hair':0, 'Platinum Blond Hair':255})
df3['HAIR_B Value'] = df3['HAIR'].map({'Red Hair':0, 'Green Hair':0, 'Blue Hair':255, 'Brown Hair':42, 'Black Hair':0, 'White Hair':255, 'Grey Hair':128, 'Blond Hair':75, 'Purple Hair':128, 'Pink Hair':203, 'Gold Hair':0, 'Orange Hair':0, 'Violet Hair':255, 'Strawberry Blond Hair':212, 'Silver Hair':192, 'Reddish Brown Hair':0, 'Platinum Blond Hair':227})

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See

## 拿掉轉換後尚有缺失的值

In [86]:
df3 = df3.dropna()

In [87]:
df3.isnull().any()

ID              False
ALIGN           False
EYE             False
HAIR            False
SEX             False
ALIVE           False
APPEARANCES     False
YEAR            False
ID Value        False
ALIGN Value     False
SEX Value       False
ALIVE Value     False
EYE_R Value     False
EYE_G Value     False
EYE_B Value     False
HAIR_R Value    False
HAIR_G Value    False
HAIR_B Value    False
dtype: bool

In [88]:
df3.head()

Unnamed: 0,ID,ALIGN,EYE,HAIR,SEX,ALIVE,APPEARANCES,YEAR,ID Value,ALIGN Value,SEX Value,ALIVE Value,EYE_R Value,EYE_G Value,EYE_B Value,HAIR_R Value,HAIR_G Value,HAIR_B Value
0,Secret Identity,Good Characters,Blue Eyes,Black Hair,Male Characters,Living Characters,3093.0,1939.0,-1,1.0,1.0,1,0,0,255,0,0,0
1,Secret Identity,Good Characters,Blue Eyes,Black Hair,Male Characters,Living Characters,2496.0,1986.0,-1,1.0,1.0,1,0,0,255,0,0,0
2,Secret Identity,Good Characters,Brown Eyes,Brown Hair,Male Characters,Living Characters,1565.0,1959.0,-1,1.0,1.0,1,165,42,42,165,42,42
3,Public Identity,Good Characters,Brown Eyes,White Hair,Male Characters,Living Characters,1316.0,1987.0,1,1.0,1.0,1,165,42,42,255,255,255
4,Secret Identity,Good Characters,Blue Eyes,Black Hair,Male Characters,Living Characters,1237.0,1940.0,-1,1.0,1.0,1,0,0,255,0,0,0


## 取出要預測的因素行，並轉成array，進行迴歸模型計算

In [89]:
x = df3.iloc[:,7:17]
y = df3['APPEARANCES']

In [90]:
x = x.values
y = y.values

In [91]:
x_train,x_test,y_train,y_test = train_test_split(x, y, test_size=0.2, random_state = 411)

In [92]:
regr = LinearRegression()

In [93]:
regr.fit(x_train,y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [94]:
y_pred = regr.predict(x_test)

In [95]:
y_pred

array([ 5.35793849, -2.49852731, 14.28194547, ..., 16.47937482,
       36.86165086,  6.59315027])

In [96]:
y_test

array([11., 16.,  1., ...,  1.,  6.,  6.])

## 不幸的，正確率只有大約15.2%QQ，表示這些因素之間應該無太大關聯

In [97]:
regr.score(x_test,y_test)

0.1518291197612136

In [98]:
result = sm.OLS(y_train,x_train).fit()

In [99]:
print(result.summary())

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.110
Model:                            OLS   Adj. R-squared:                  0.108
Method:                 Least Squares   F-statistic:                     57.70
Date:                Fri, 12 Apr 2019   Prob (F-statistic):          1.41e-110
Time:                        13:59:27   Log-Likelihood:                -27671.
No. Observations:                4662   AIC:                         5.536e+04
Df Residuals:                    4652   BIC:                         5.543e+04
Df Model:                          10                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
x1             0.0039      0.002      2.451      0.0