<a href="https://colab.research.google.com/github/Khaled-Abdelhamid/Death-Big-data-Analytics/blob/Ahmed/big_trend_fitting.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this notebook, I will try to fit the regression model of the trend of amount of the deceased from 2005 to 2014 and use this model to predict amount of the deceased in 2015

The trend includes resident status, race, gender and death cause of the deceased

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [4]:
import numpy as np 
import pandas as pd 
import time
import json
import gc
import xgboost as xgb 
from sklearn.model_selection import GridSearchCV
import matplotlib.pyplot as plt

In [5]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://downloads.apache.org/spark/spark-2.4.7/spark-2.4.7-bin-hadoop2.7.tgz
!tar xf spark-2.4.7-bin-hadoop2.7.tgz
!pip install -q findspark

In [6]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.7-bin-hadoop2.7"
# os.environ["SPARK_HOME"] ="/content/drive/MyDrive/Colab Notebooks/BigData/spark-2.4.7-bin-hadoop2.7"


In [7]:
import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()

In [58]:
from pyspark.sql import functions as F
from pyspark.sql.types import DoubleType, IntegerType, StringType
from pyspark.sql.functions import udf,col


In [9]:
start = time.time()
data_path="/content/drive/MyDrive/Colab Notebooks/BigData/Final project/Death-Big-data-Analytics/archive"
df=spark.read.options(header=True,inferSchema=True).csv(data_path)
df.show(truncate=False)
print((time.time()-start)/60)

+---------------+-----------------------+-----------------------+------------------------+--------------+---+---------------+----------+---------------------+-------------+-------------+-------------+--------------------+-----------------------------------+--------------+--------------------+-----------------+--------------+---------------+---------------------+-------+-------------+------------------------------------------------------+----------------------+----------------+----------------+-----------------------+---------------+--------------------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+--------------------------------+-----

In [16]:
df.dtypes

[('resident_status', 'string'),
 ('education_1989_revision', 'string'),
 ('education_2003_revision', 'string'),
 ('education_reporting_flag', 'string'),
 ('month_of_death', 'string'),
 ('sex', 'string'),
 ('detail_age_type', 'string'),
 ('detail_age', 'string'),
 ('age_substitution_flag', 'string'),
 ('age_recode_52', 'string'),
 ('age_recode_27', 'string'),
 ('age_recode_12', 'string'),
 ('infant_age_recode_22', 'string'),
 ('place_of_death_and_decedents_status', 'string'),
 ('marital_status', 'string'),
 ('day_of_week_of_death', 'string'),
 ('current_data_year', 'string'),
 ('injury_at_work', 'string'),
 ('manner_of_death', 'string'),
 ('method_of_disposition', 'string'),
 ('autopsy', 'string'),
 ('activity_code', 'string'),
 ('place_of_injury_for_causes_w00_y34_except_y06_and_y07_', 'string'),
 ('icd_code_10th_revision', 'string'),
 ('358_cause_recode', 'string'),
 ('113_cause_recode', 'string'),
 ('130_infant_cause_recode', 'string'),
 ('39_cause_recode', 'string'),
 ('number_of_en

# Data Preprocessing and Feature Engineering

In [212]:
df=df.filter("current_data_year in ('2005','2006','2007','2008','2009','2010','2011','2012','2013','2014','2015')")\
               .filter('sex in ("M","F")')


In [213]:
features = ['current_data_year','month_of_death','resident_status','sex','race_recode_3','39_cause_recode']
train_1=df.select(features)

# df2.groupby("current_data_year").count().show()


In [29]:
train_1.groupby("race_recode_3").count().show()

+-------------+--------+
|race_recode_3|   count|
+-------------+--------+
|            3| 3258257|
|            1|23703335|
|            2|  759081|
+-------------+--------+



In [28]:
train_1.groupby("39_cause_recode").count().show()

+---------------+-------+
|39_cause_recode|  count|
+---------------+-------+
|             07| 405728|
|             15|1630010|
|             11| 311877|
|             29|  33924|
|             42|  59467|
|             30| 361128|
|             34| 110071|
|             01|   6169|
|             22|2060821|
|             28|1538728|
|             16| 807571|
|             35|  21919|
|             31| 520221|
|             27| 606709|
|             17| 921265|
|             26| 222159|
|             09| 455641|
|             05| 124569|
|             41| 189552|
|             23| 301297|
+---------------+-------+
only showing top 20 rows



In [27]:
train_1.groupby("resident_status").count().show()

+---------------+--------+
|resident_status|   count|
+---------------+--------+
|              3|  905032|
|              1|22449300|
|              4|   50320|
|              2| 4316021|
+---------------+--------+



In [22]:
train_1.groupby("current_data_year").count().show()

+-----------------+-------+
|current_data_year|  count|
+-----------------+-------+
|             2012|2547864|
|             2014|2631171|
|             2013|2601452|
|             2005|2452506|
|             2009|2441219|
|             2006|2430725|
|             2011|2519842|
|             2008|2476811|
|             2007|2428343|
|             2015|2718198|
|             2010|2472542|
+-----------------+-------+



In [26]:
train_1.groupby("month_of_death").count().show()

+--------------+-------+
|month_of_death|  count|
+--------------+-------+
|            07|2205194|
|            11|2285476|
|            01|2571615|
|            09|2146811|
|            05|2278865|
|            08|2191365|
|            03|2494294|
|            02|2324679|
|            06|2154291|
|            10|2292374|
|            12|2479590|
|            04|2296119|
+--------------+-------+



In [46]:
def read_json(path):
    with open(path,'r', encoding = 'utf-8') as f:
        definition = json.load(f)
    return definition

In [47]:
path="/content/drive/MyDrive/Colab Notebooks/BigData/Final project/Death-Big-data-Analytics/archive"
definition = read_json(path+'/2005_codes.json')
definition['place_of_death_and_decedents_status']['1'] = 'Hospital, Clinic or Medical Center'

Typo in the json file

In [304]:
cause_definition = {}
neoplasm = list(range(4,16))
heart = list(range(19,23))

cause_dict = {'accident':1, 'heart':2, 'neoplasm':3,'others':4}

for i in range(1,43):
    if i in neoplasm:
        cause_definition[i] = cause_dict['neoplasm'] # code for neoplasmic deaths
    elif i in heart:
        cause_definition[i] = cause_dict['heart'] # code for heart related deaths
    elif i in [38,39]:
        cause_definition[i] = cause_dict['accident'] # code for accident related deaths
    else:
        cause_definition[i] = cause_dict['others'] # code for other deaths

In [305]:
cause_definition

{1: 4,
 2: 4,
 3: 4,
 4: 3,
 5: 3,
 6: 3,
 7: 3,
 8: 3,
 9: 3,
 10: 3,
 11: 3,
 12: 3,
 13: 3,
 14: 3,
 15: 3,
 16: 4,
 17: 4,
 18: 4,
 19: 2,
 20: 2,
 21: 2,
 22: 2,
 23: 4,
 24: 4,
 25: 4,
 26: 4,
 27: 4,
 28: 4,
 29: 4,
 30: 4,
 31: 4,
 32: 4,
 33: 4,
 34: 4,
 35: 4,
 36: 4,
 37: 4,
 38: 1,
 39: 1,
 40: 4,
 41: 4,
 42: 4}

In [306]:
features = ['current_data_year','month_of_death','resident_status','sex','race_recode_3','39_cause_recode']
train_1=df.select(features)


In [307]:
@udf(returnType=IntegerType()) 
def convertCause(val):
  return cause_definition[int(val)]

@udf(returnType=IntegerType()) 
def convertResidency(val):
  return 1 if val=="1" or val=="4" else 0

@udf(returnType=IntegerType()) 
def convertSex(val):
  return 1 if val=="M"  else 0

In [308]:
train_1=train_1.withColumn('cause_recode',convertCause(col('39_cause_recode')))\
               .withColumn('resident_status',convertResidency(col('resident_status')))\
               .withColumn('sex',convertSex(col('sex')))\
               .withColumn("current_data_year",col("current_data_year").cast("Integer"))\
               .withColumn("month_of_death",col("month_of_death").cast("Integer"))\
               .withColumn("race_recode",col("race_recode_3").cast("Integer"))\
               .withColumn("numeric_month",col("month_of_death")+(col("current_data_year")-2005)*12)\
               .filter('cause_recode != 4') # remove all of death samples with cause code (others)

In [309]:
train_1.show()

+-----------------+--------------+---------------+---+-------------+---------------+------------+-----------+-------------+
|current_data_year|month_of_death|resident_status|sex|race_recode_3|39_cause_recode|cause_recode|race_recode|numeric_month|
+-----------------+--------------+---------------+---+-------------+---------------+------------+-----------+-------------+
|             2006|             1|              1|  0|            1|             22|           2|          1|           13|
|             2006|             1|              1|  0|            1|             21|           2|          1|           13|
|             2006|             1|              1|  0|            3|             09|           3|          3|           13|
|             2006|             1|              1|  1|            1|             38|           1|          1|           13|
|             2006|             1|              1|  1|            1|             22|           2|          1|           13|
|       

In [310]:
train_1.dtypes

[('current_data_year', 'int'),
 ('month_of_death', 'int'),
 ('resident_status', 'int'),
 ('sex', 'int'),
 ('race_recode_3', 'string'),
 ('39_cause_recode', 'string'),
 ('cause_recode', 'int'),
 ('race_recode', 'int'),
 ('numeric_month', 'int')]

In [311]:
cols=['numeric_month','resident_status','sex','race_recode','cause_recode']
death_cnt_group= train_1.select(cols)\
                        .groupby('numeric_month','resident_status','sex','race_recode','cause_recode')\
                        .count()

In [313]:
cols=['numeric_month','resident_status','count']
death_cnt_group2 = death_cnt_group.select(cols).groupby('numeric_month','resident_status').agg(F.mean('count')).withColumnRenamed("avg(count)","resident_status_death_mean")
death_cnt_group=death_cnt_group.join(death_cnt_group2,['numeric_month','resident_status'] ,how='left_outer')

In [314]:
cols=['numeric_month','sex','count']
death_cnt_group3 = death_cnt_group.select(cols).groupby('numeric_month','sex').agg(F.mean('count')).withColumnRenamed("avg(count)","sex_death_mean")
death_cnt_group=death_cnt_group.join(death_cnt_group3, ['numeric_month','sex'],how='left')

In [315]:
cols=['numeric_month','race_recode','count']
death_cnt_group4 = death_cnt_group.select(cols).groupby('numeric_month','race_recode').agg(F.mean('count')).withColumnRenamed("avg(count)","race_recode_death_mean")
death_cnt_group=death_cnt_group.join(death_cnt_group4, ['numeric_month','race_recode'],how='left')

In [316]:
cols=['numeric_month','cause_recode','count']
death_cnt_group5 = death_cnt_group.select(cols).groupby('numeric_month','cause_recode').agg(F.mean('count')).withColumnRenamed("avg(count)","cause_recode_death_mean")
death_cnt_group=death_cnt_group.join(death_cnt_group5,['numeric_month','cause_recode'],how='left')

In [317]:
death_cnt_group.dtypes

[('numeric_month', 'int'),
 ('cause_recode', 'int'),
 ('race_recode', 'int'),
 ('sex', 'int'),
 ('resident_status', 'int'),
 ('count', 'bigint'),
 ('resident_status_death_mean', 'double'),
 ('sex_death_mean', 'double'),
 ('race_recode_death_mean', 'double'),
 ('cause_recode_death_mean', 'double')]

In [324]:
col_lags=['count','resident_status_death_mean','sex_death_mean','race_recode_death_mean']
for column in col_lags:
  cols=['numeric_month','resident_status','sex','race_recode','cause_recode',column]
  for lag in [1,6,12]:
    tmp = death_cnt_group.select(cols)
    shifted=tmp.withColumn("numeric_month",col("numeric_month")+lag)\
               .withColumnRenamed(column,f"{column}_lag_{lag}")
    death_cnt_group = death_cnt_group.join(shifted,cols[:-1],how='left')


In [325]:
death_cnt_group.dtypes

[('numeric_month', 'int'),
 ('resident_status', 'int'),
 ('sex', 'int'),
 ('race_recode', 'int'),
 ('cause_recode', 'int'),
 ('count', 'bigint'),
 ('resident_status_death_mean', 'double'),
 ('sex_death_mean', 'double'),
 ('race_recode_death_mean', 'double'),
 ('cause_recode_death_mean', 'double'),
 ('count_lag_1', 'bigint'),
 ('count_lag_6', 'bigint'),
 ('count_lag_12', 'bigint'),
 ('resident_status_death_mean_lag_1', 'double'),
 ('resident_status_death_mean_lag_6', 'double'),
 ('resident_status_death_mean_lag_12', 'double'),
 ('sex_death_mean_lag_1', 'double'),
 ('sex_death_mean_lag_6', 'double'),
 ('sex_death_mean_lag_12', 'double'),
 ('race_recode_death_mean_lag_1', 'double'),
 ('race_recode_death_mean_lag_6', 'double'),
 ('race_recode_death_mean_lag_12', 'double')]

In [None]:
death_cnt_group.show()

In [None]:
death_cnt_group=death_cnt_group.fillna(0)

## Generate average count and Create time lag features

Lag features has null value so I fill them with 0

In [None]:
onehot_race = pd.get_dummies(death_cnt_group.race_recode_3, prefix='race')
death_cnt_group = pd.concat([death_cnt_group, onehot_race],axis=1)

In [None]:
onehot_cause = pd.get_dummies(death_cnt_group.reconstr_cause_definition_1, prefix='cause')
death_cnt_group = pd.concat([death_cnt_group, onehot_cause],axis=1)

In [None]:
death_cnt_group.columns

In [None]:
path='/content/drive/MyDrive/Colab Notebooks/BigData/Final project'
death_cnt_group.to_pickle(path+'/Training_Regression.pkl')

# Modeling and Fine Tuning

In [None]:
death_cnt_group = pd.read_pickle(path + '/Training_Regression.pkl')

In [None]:
death_cnt_group.columns

In [None]:
death_cnt_group.info()

In [None]:
train_columns = ['numeric_month', 'is_resident', 'sex', 'race_1','race_2','cause_1','cause_2',
       'helper_lag_1', 'helper_lag_6', 'helper_lag_12',
       'month_avg_resident_death_lag_1', 'month_avg_resident_death_lag_6','month_avg_resident_death_lag_12', 
       'month_avg_sex_death_lag_1','month_avg_sex_death_lag_6', 'month_avg_sex_death_lag_12',
       'month_avg_race_death_lag_1', 'month_avg_race_death_lag_6','month_avg_race_death_lag_12', 
       'month_avg_cause_death_lag_1','month_avg_cause_death_lag_6', 'month_avg_cause_death_lag_12']

In [None]:
train = death_cnt_group[death_cnt_group.numeric_month<132]
train_X = train[train_columns]
train_y = train['helper']
test = death_cnt_group[death_cnt_group.numeric_month==132]
test_X = test[train_columns]
test_y = test['helper']

In [None]:
train_X.columns

In [None]:
cv_params = {'n_estimators':[300,400,500], 'learning_rate':np.arange(0.3,0.51,0.01)}
xgb_model = xgb.XGBRegressor(max_depth=8, subsample=0.8, colsample_bytree=0.8, colsample_bynode=0.8,seed=42)
cv1 = GridSearchCV(estimator=xgb_model, param_grid=cv_params, scoring='neg_root_mean_squared_error', cv=5, verbose=1)
cv1.fit(train_X, train_y)

In [None]:
cv1.best_params_, cv1.best_score_

## Final model

In [None]:
cv_xgb_model = xgb.XGBRegressor(n_estimators=500,learning_rate=0.5,max_depth=8,
                            subsample=0.8, colsample_bytree=0.8, colsample_bynode=0.8,colsample_bylevel=1,
                            gamma=100, min_child_weight=1, reg_lambda=1, seed=42)
cv_xgb_model.fit(train_X,train_y, eval_set=[(train_X,train_y),(test_X,test_y)],eval_metric='rmse',early_stopping_rounds=10)

In [None]:
cv_xgb_model.get_booster().get_score(importance_type='gain')

In [None]:
fig,ax = plt.subplots(figsize=(12,10))
xgb.plot_importance(cv_xgb_model,ax=ax,importance_type='gain')
plt.yticks(fontsize=14)

In [None]:
cv_xgb_model.save_model('best_model_1.json')

# Modeling result visualization

In this part, I will plot the fit results of different features with two kinds of plot.

In [None]:
pred_cnt = cv_xgb_model.predict(death_cnt_group[train_columns])

In [None]:
death_cnt_group['pred_cnt'] = pred_cnt

In [None]:
death_cnt_group.sample(2)

## Trend of different resident status

In [None]:
death_cnt_resident_0 = death_cnt_group[death_cnt_group.is_resident==0]
grouped_resident_0 = death_cnt_resident_0[['numeric_month','helper','pred_cnt']].groupby('numeric_month').agg('sum').reset_index()
grouped_resident_0.plot(x='numeric_month',y=['helper','pred_cnt'],figsize=(10,5), grid=True, style=['rx','o'])
plt.title('pred VS true for non-resident trend',fontsize=16)
plt.legend(loc='upper left',labels=['true','pred'])
plt.xticks(np.arange(1,134,11))
plt.xlabel('numeric month',fontsize=14)
plt.show()

In [None]:
fig,ax = plt.subplots(figsize=(10,5))
ax.plot(grouped_resident_0['numeric_month'],grouped_resident_0['helper'],color='red',label='true')
ax.plot(grouped_resident_0['numeric_month'],grouped_resident_0['pred_cnt'],color='blue',label='pred')
ax.set_title('pred VS true for non-resident trend',fontsize=16)
ax.set_xticks(np.arange(1,134,11))
ax.set_xlabel('numeric month',fontsize=14)
ax.legend(loc='upper left',labels=['true','pred'])

The trend from month 40 to 47 (2008.6~2009.2) is not captured well by the model and the values are not predicted quit accurate. The probable reason is that CDC did not record all the non-resident deceased and there is stochastic noise in data.

In [None]:
death_cnt_resident_1 = death_cnt_group[death_cnt_group.is_resident==1]
grouped_resident_1 = death_cnt_resident_1[['numeric_month','helper','pred_cnt']].groupby('numeric_month').agg('sum').reset_index()
grouped_resident_1.plot(x='numeric_month',y=['helper','pred_cnt'],figsize=(10,5), grid=True, style=['rx','o'])
plt.title('pred VS true for resident trend',fontsize=16)
plt.legend(loc='upper left',labels=['true','pred'])
plt.xticks(np.arange(1,134,11))
plt.xlabel('numeric month',fontsize=14)
plt.show()

In [None]:
fig,ax = plt.subplots(figsize=(10,5))
ax.plot(grouped_resident_1['numeric_month'],grouped_resident_1['helper'],color='red',label='true')
ax.plot(grouped_resident_1['numeric_month'],grouped_resident_1['pred_cnt'],color='blue',label='pred')
ax.set_title('pred VS true for resident trend',fontsize=16)
ax.set_xticks(np.arange(1,134,11))
ax.set_xlabel('numeric month',fontsize=14)
ax.legend(loc='upper left',labels=['true','pred'])

The model performs quite well in capturing the resident death trend and value.

## Trend of different causes

In [None]:
death_cnt_cause_1 = death_cnt_group[death_cnt_group.reconstr_cause_definition_1==1]
grouped_cause_1 = death_cnt_cause_1[['numeric_month','helper','pred_cnt']].groupby('numeric_month').agg('sum').reset_index()
grouped_cause_1.plot(x='numeric_month',y=['helper','pred_cnt'],figsize=(10,5), grid=True, style=['rx','o'])
plt.title("pred VS true for 'cause:accident' trend",fontsize=16)
plt.legend(loc='upper left',labels=['true','pred'])
plt.xticks(np.arange(1,134,11))
plt.xlabel('numeric month',fontsize=14)
plt.show()

In [None]:
fig,ax = plt.subplots(figsize=(10,5))
ax.plot(grouped_cause_1['numeric_month'],grouped_cause_1['helper'],color='red',label='true')
ax.plot(grouped_cause_1['numeric_month'],grouped_cause_1['pred_cnt'],color='blue',label='pred')
ax.set_title("pred VS true for 'cause:accident' trend",fontsize=16)
ax.set_xticks(np.arange(1,134,11))
ax.set_xlabel('numeric month', fontsize=14)
ax.legend(loc='upper left')

The trend is almost captured by the model but the real value is not predicted well. Probably the reason is that accident something out of expect and it is not easy to predict its value by machine learning models.  

In [None]:
death_cnt_cause_2 = death_cnt_group[death_cnt_group.reconstr_cause_definition_1==2]
grouped_cause_2 = death_cnt_cause_2[['numeric_month','helper','pred_cnt']].groupby('numeric_month').agg('sum').reset_index()
grouped_cause_2.plot(x='numeric_month',y=['helper','pred_cnt'],figsize=(10,5), grid=True, style=['rx','o'])
plt.title("pred VS true for 'cause:heart' trend",fontsize=16)
plt.legend(loc='upper center',labels=['true','pred'])
plt.xticks(np.arange(1,134,11))
plt.xlabel('numeric month',fontsize=14)
plt.show()

In [None]:
fig,ax = plt.subplots(figsize=(10,5))
ax.plot(grouped_cause_2['numeric_month'],grouped_cause_2['helper'],color='red',label='true')
ax.plot(grouped_cause_2['numeric_month'],grouped_cause_2['pred_cnt'],color='blue',label='pred')
ax.set_title("pred VS true for 'cause:heart' trend",fontsize=16)
ax.set_xticks(np.arange(1,134,11))
ax.set_xlabel('numeric month', fontsize=14)
ax.legend(loc='upper center')

The model performs quite well in capturing the heart death trend and value.

In [None]:
death_cnt_cause_3 = death_cnt_group[death_cnt_group.reconstr_cause_definition_1==3]
grouped_cause_3 = death_cnt_cause_3[['numeric_month','helper','pred_cnt']].groupby('numeric_month').agg('sum').reset_index()
grouped_cause_3.plot(x='numeric_month',y=['helper','pred_cnt'],figsize=(10,5), grid=True, style=['rx','o'])
plt.title("pred VS true for 'cause:neoplasm' trend",fontsize=16)
plt.xlabel('numeric month',fontsize=14)
plt.legend(loc='upper left',labels=['true','pred'])
plt.xticks(np.arange(1,134,11))
plt.show()

In [None]:
fig,ax = plt.subplots(figsize=(10,5))
ax.plot(grouped_cause_3['numeric_month'],grouped_cause_3['helper'],color='red',label='true')
ax.plot(grouped_cause_3['numeric_month'],grouped_cause_3['pred_cnt'],color='blue',label='pred')
ax.set_title("pred VS true for 'cause:neoplasm' trend",fontsize=16)
ax.set_xticks(np.arange(1,134,11))
ax.set_xlabel('numeric month',fontsize=14)
ax.legend(loc='upper left')

The model performs relatively well in capturing the neoplasm death trend and values when compared with accident trend.

## Trend of different genders

In [None]:
death_cnt_sex_0 = death_cnt_group[death_cnt_group.sex==0]
grouped_sex_0 = death_cnt_sex_0[['numeric_month','helper','pred_cnt']].groupby('numeric_month').agg('sum').reset_index()
grouped_sex_0.plot(x='numeric_month',y=['helper','pred_cnt'],figsize=(10,5), grid=True, style=['rx','o'])
plt.title('pred VS true for female trend',fontsize=16)
plt.legend(loc='upper center',labels=['true','pred'])
plt.xticks(np.arange(1,134,11))
plt.xlabel('numeric month',fontsize=14)
plt.show()

In [None]:
fig,ax = plt.subplots(figsize=(10,5))
ax.plot(grouped_sex_0['numeric_month'],grouped_sex_0['helper'],color='red',label='true')
ax.plot(grouped_sex_0['numeric_month'],grouped_sex_0['pred_cnt'],color='blue',label='pred')
ax.set_title("pred VS true for female trend",fontsize=16)
ax.set_xticks(np.arange(1,134,11))
ax.set_xlabel('numeric month',fontsize=14)
ax.legend(loc='upper center')

The model performs quite well in capturing the female death trend and predicting value.

In [None]:
death_cnt_sex_1 = death_cnt_group[death_cnt_group.sex==1]
grouped_sex_1 = death_cnt_sex_1[['numeric_month','helper','pred_cnt']].groupby('numeric_month').agg('sum').reset_index()
grouped_sex_1.plot(x='numeric_month',y=['helper','pred_cnt'],figsize=(10,5), grid=True, style=['rx','o'])
plt.title('pred VS true for male trend',fontsize=16)
plt.legend(loc='upper left',labels=['true','pred'])
plt.xticks(np.arange(1,134,11))
plt.xlabel('numeric month',fontsize=14)
plt.show()

In [None]:
fig,ax = plt.subplots(figsize=(10,5))
ax.plot(grouped_sex_1['numeric_month'],grouped_sex_1['helper'],color='red',label='true')
ax.plot(grouped_sex_1['numeric_month'],grouped_sex_1['pred_cnt'],color='blue',label='pred')
ax.set_title("pred VS true for male trend",fontsize=16)
ax.set_xticks(np.arange(1,134,11))
ax.set_xlabel('numeric month',fontsize=14)
ax.legend(loc='upper left')

The model performs well in capturing the female death trend and predicting value.

## Trend of different races

In [None]:
death_cnt_race_1 = death_cnt_group[death_cnt_group.race_recode_3==1]
grouped_race_1 = death_cnt_race_1[['numeric_month','helper','pred_cnt']].groupby('numeric_month').agg('sum').reset_index()
grouped_race_1.plot(x='numeric_month',y=['helper','pred_cnt'],figsize=(10,5), grid=True, style=['rx','o'])
plt.title("pred VS true for 'race:white' trend",fontsize=16)
plt.legend(loc='upper left',labels=['true','pred'])
plt.xticks(np.arange(1,134,11))
plt.xlabel('numeric month',fontsize=14)
plt.show()

In [None]:
fig,ax = plt.subplots(figsize=(10,5))
ax.plot(grouped_race_1['numeric_month'],grouped_race_1['helper'],color='red',label='true')
ax.plot(grouped_race_1['numeric_month'],grouped_race_1['pred_cnt'],color='blue',label='pred')
ax.set_title("pred VS true for 'race:white' trend",fontsize=16)
ax.set_xticks(np.arange(1,134,11))
ax.set_xlabel('numeric month',fontsize=14)
ax.legend(loc='upper left')

The model performs well in capturing the white people death trend and predicting value.

In [None]:
death_cnt_race_2 = death_cnt_group[death_cnt_group.race_recode_3==2]
grouped_race_2 = death_cnt_race_2[['numeric_month','helper','pred_cnt']].groupby('numeric_month').agg('sum').reset_index()
grouped_race_2.plot(x='numeric_month',y=['helper','pred_cnt'],figsize=(10,5), grid=True, style=['rx','o'])
plt.title("pred VS true for 'race:others' trend",fontsize=16)
plt.legend(loc='upper left',labels=['true','pred'])
plt.xticks(np.arange(1,134,11))
plt.xlabel('numeric month',fontsize=14)
plt.show()

In [None]:
fig,ax = plt.subplots(figsize=(10,5))
ax.plot(grouped_race_2['numeric_month'],grouped_race_2['helper'],color='red',label='true')
ax.plot(grouped_race_2['numeric_month'],grouped_race_2['pred_cnt'],color='blue',label='pred')
ax.set_title("pred VS true for 'race:others' trend",fontsize=16)
ax.set_xticks(np.arange(1,134,11))
ax.set_xlabel('numeric month',fontsize=14)
ax.legend(loc='upper left')

The trend is not captured well by the model. The probable reason is that there exist stochastic noise in data because of CDC's record flaws. You can see from the figure that the number of the deceased is very small in each year and CDC may not record all the deceased of other races.

In [None]:
death_cnt_race_3 = death_cnt_group[death_cnt_group.race_recode_3==3]
grouped_race_3 = death_cnt_race_3[['numeric_month','helper','pred_cnt']].groupby('numeric_month').agg('sum').reset_index()
grouped_race_3.plot(x='numeric_month',y=['helper','pred_cnt'],figsize=(10,5), grid=True, style=['rx','o'])
plt.title("pred VS true for 'race:black' trend",fontsize=16)
plt.legend(loc='upper left',labels=['true','pred'])
plt.xticks(np.arange(1,134,11))
plt.xlabel('numeric month',fontsize=14)
plt.show()

In [None]:
fig,ax = plt.subplots(figsize=(10,5))
ax.plot(grouped_race_3['numeric_month'],grouped_race_3['helper'],color='red',label='true')
ax.plot(grouped_race_3['numeric_month'],grouped_race_3['pred_cnt'],color='blue',label='pred')
ax.set_title("pred VS true for 'race:black' trend",fontsize=16)
ax.set_xticks(np.arange(1,134,11))
ax.set_xlabel('numeric month',fontsize=14)
ax.legend(loc='upper left')

The trend in month from 5 to 12 is not captured well and the value is not predicted well by the model. The model performs better than the deceased of other races, but worse than the white peole.