# KDD Assignment 2
![CS306](https://img.shields.io/badge/CS306-Data%20Mining-orange) &nbsp;
![2022s](https://img.shields.io/badge/semester-2022%20spring-blue)

Author: 何泽安 (He Zean) &nbsp;&nbsp; SID: 12011323

## Part 1. Data Collection

We first analyze the network packages, and find the useful API url. We only need the chinaDayList and chinaDayAddList, so we modify the url as shows in the below code.

<img src="https://i.imgur.com/JcNXIGi.png" alt="api capture" style="zoom:25%;" />

In [1]:
from urllib.request import urlopen
import json

api = 'https://api.inews.qq.com/newsqa/v1/query/inner/publish/modules/list?modules=chinaDayList,chinaDayAddList'
raw_data = urlopen(api).read().decode('utf-8')
raw_data = json.loads(raw_data)['data']

raw_data

{'chinaDayAddList': [{'suspect': 3,
   'heal': 1521,
   'localinfectionadd': 2177,
   'confirm': 3551,
   'dead': 249,
   'importedCase': 81,
   'infect': 2316,
   'localConfirmadd': 1656,
   'deadRate': '7.0',
   'healRate': '42.8',
   'date': '03.19',
   'y': '2022'},
  {'suspect': 6,
   'dead': 246,
   'localinfectionadd': 2384,
   'localConfirmadd': 1947,
   'deadRate': '7.2',
   'confirm': 3423,
   'heal': 1467,
   'importedCase': 80,
   'infect': 2492,
   'healRate': '42.9',
   'date': '03.20',
   'y': '2022'},
  {'heal': 980,
   'importedCase': 57,
   'localinfectionadd': 2313,
   'localConfirmadd': 2281,
   'deadRate': '6.3',
   'suspect': 0,
   'dead': 223,
   'healRate': '27.5',
   'date': '03.21',
   'y': '2022',
   'confirm': 3563,
   'infect': 2432},
  {'confirm': 3813,
   'dead': 245,
   'importedCase': 76,
   'infect': 2469,
   'localinfectionadd': 2346,
   'localConfirmadd': 2591,
   'deadRate': '6.4',
   'date': '03.22',
   'y': '2022',
   'suspect': 3,
   'heal': 1938

## Part 2. Data Cleaning

We then compare the data with the data displayed in the web page, and analyze the labels' correspondance.

```json
{
    "chinaDayAddList": [
        {
            "y": "2022",
            "confirm": 5451,                 // 新增确诊
            "suspect": 0,                    // 新增疑似
            "date": "04.15"
        }
    ],
    "chinaDayList": [
        {
            "y": "2022",
            "nowConfirm": 259560,            // 现有确诊
            "dead": 14561,                   // 累计死亡
            "heal": 227416,                  // 累计治愈
            "confirm": 519822,               // 累计确诊
            "date": "04.15"
        }
    ]
}
```

In [2]:
import pandas as pd

day_add = pd.DataFrame.from_records(raw_data['chinaDayAddList'])
day_add['date'] = day_add['y'] + '-' + day_add['date'].str.replace('\\.', '-', regex=True)
day_add['date'] = pd.to_datetime(day_add['date'])

day_add.drop(day_add.columns.difference(['date', 'confirm', 'suspect']), axis=1, inplace=True)  # keep only info we need

day_add.sort_values(by='date', inplace=True)
day_add = day_add.tail(30)  # we only need the last 30 days
day_add.fillna(day_add.mean(numeric_only=True), inplace=True)  # fill missing values with mean
day_add['day_bias'] = (day_add['date'] - day_add['date'].min()) / pd.Timedelta('1 days') + 1  # the day from the first day in the seq (30 days)

day_add.tail(5)  # preview

Unnamed: 0,suspect,confirm,date,day_bias
55,0,64238,2022-05-14,26.0
56,0,68916,2022-05-15,27.0
57,0,61933,2022-05-16,28.0
58,0,66109,2022-05-17,29.0
59,0,85364,2022-05-18,30.0


In [3]:
day_info = pd.DataFrame.from_records(raw_data['chinaDayList'])
day_info['date'] = day_info['y'] + '-' + day_info['date'].str.replace('\\.', '-', regex=True)
day_info['date'] = pd.to_datetime(day_info['date'])

day_info.drop(day_info.columns.difference(['date', 'dead', 'heal', 'confirm', 'nowConfirm']), axis=1, inplace=True)  # keep only info we need
day_info.sort_values(by='date', inplace=True)
day_info = day_info.tail(30)  # we only need the last 30 days

day_info.fillna(day_info.mean(numeric_only=True), inplace=True)  # fill missing values with mean
day_info['day_bias'] = (day_info['date'] - day_info['date'].min()) / pd.Timedelta('1 days') + 1  # the day from the first day in the seq (30 days)

day_info.tail(5)  # preview

Unnamed: 0,date,dead,nowConfirm,confirm,heal,day_bias
55,2022-05-14,15618,952224,1253277,285435,26.0
56,2022-05-15,15642,1020408,1322193,286143,27.0
57,2022-05-16,15672,1081796,1384126,286658,28.0
58,2022-05-17,15713,1147370,1450235,287152,29.0
59,2022-05-18,15759,1232184,1535599,287656,30.0


## Part 3. Linear Regression Models

In [4]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import numpy as np


def modeling(name, x, y):
    print(name)
    # training
    model = LinearRegression()
    X_train, X_test, y_train, y_test = train_test_split(
        x.values.reshape(-1, 1), y, test_size=0.2, shuffle=False)  # time series
    model.fit(X_train, y_train)
    print(f'Y = {model.coef_[0]} * X + {model.intercept_}')

    # validating
    rmse = np.sqrt(mean_squared_error(y_test, model.predict(X_test)))
    print(f'RMSE = {rmse}')

    # predicting
    pred_day_bias = [[x.max() + 1]]  # 2D array for prediction
    print(f'Pred = {model.predict(pred_day_bias)[0]:.3f}')

In [5]:
modeling('Now Confirm (现有确诊)', day_info['day_bias'], day_info['nowConfirm'])

Now Confirm (现有确诊)
Y = 21122.561304347822 * X + 185798.35869565222
RMSE = 297900.18913779483
Pred = 840597.759


In [6]:
modeling('New Confirm (新增确诊)', day_add['day_bias'], day_add['confirm'])

New Confirm (新增确诊)
Y = 2355.0369565217393 * X + -4878.253623188408
RMSE = 10695.218975876382
Pred = 68127.892


In [7]:
modeling('Now Suspect (新增疑似)', day_add['day_bias'], day_add['suspect'])

Now Suspect (新增疑似)
Y = 0.005217391304347826 * X + 0.10144927536231883
RMSE = 0.24508956136418178
Pred = 0.263


In [8]:
modeling('Accumulated Confirm (累计确诊)', day_info['day_bias'], day_info['confirm'])

Accumulated Confirm (累计确诊)
Y = 23367.355652173912 * X + 435908.72101449274
RMSE = 287649.98095579416
Pred = 1160296.746


In [9]:
modeling('Accumulated Heal (累计治愈)', day_info['day_bias'], day_info['heal'])

Accumulated Heal (累计治愈)
Y = 2157.5939130434776 * X + 236454.57608695654
RMSE = 9852.368743932368
Pred = 303339.987


In [10]:
modeling('Accumulated Dead (累计死亡)', day_info['day_bias'], day_info['dead'])

Accumulated Dead (累计死亡)
Y = 39.44260869565217 * X + 14675.967391304348
RMSE = 98.06515957287887
Pred = 15898.688
