## **MAST30034 Applied Data Science week 3**

### Git

### Project 1选题：
- 不要过于刁钻
- 不要过于简单
- 想象力
- 实际性

eg：

#### 预测类
1. forecasting trip duration
2. forecasting tip amount
3. forecasting drop off location

#### 分析类
1. relation between weather and trip distance/fare/amount/tips
2. relation between public events and trip amount
3. relation between location and fare/tips/duration
4. generalised analysis on fare/tips/duration

#### 比较类
1. impact of Covid-19 on taxis
2. taxi demand in different public locations
3. taxi demand throughout the day/a week/a year
4. time series analysis



#### **Topic: Generalized analysis on Tip amount**

- Aim:
    * Explore key factors on Tip amount
    * Make predictions on it

- Based on the perspective of taxi company and drivers

- Choose of timeline and taxi type: 2019 and 2020, Yellow taxi

- External Dataset: Weather

- Assumptions:
    * People have adapted living with Covid-19
    * Only looking at those pay by credit card
    * Weather is all the same throughout a whole day

##### 1. 数据预处理

我们可以把整个preliminary分为两个阶段：
1. filter invalid data
2. exclude outliers for analysis 


In [None]:
from pyspark.sql import SparkSession
from urllib.request import urlretrieve
# Create a spark session (which will run spark jobs)
spark = (
    SparkSession.builder.appName("ADS project 1")
    .config("spark.sql.repl.eagerEval.enabled", True) 
    .config("spark.executor.memory","2G")
    .config("spark.driver.memory","4G")
    .config("spark.sql.parquet.cacheMetadata", "true")
    .config("spark.sql.session.timeZone", "Etc/UTC")
    .getOrCreate()
)

In [None]:
spark.conf.set("spark.sql.parquet.compression.codec", "gzip")

In [None]:
from pyspark.sql import functions as F
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sbs
import geopandas as gpd
import folium

In [None]:
# The timeline of year and month of data to use
path_raw = f"data/"
#path_curated = f"../data/curated/tlc_data/"

In [None]:
# read in the data
sdf = spark.read.parquet(f'data/2019-01.parquet')
sdf

Create new features:
- time duration
- weekend or weekday
- average speed
- whether airport
- tip rate
- congestion zone

Clean invalid data:
- time duration < 60s
- tip < 0
- distance < 0
- fare amount < 2.5
- payment != 1
- missing values
- outliers

In [None]:
sdf = add_feature(sdf)
sdf

In [None]:
sdf = clean_feature(sdf)
sdf

In [None]:
sdf.count()

##### 2. 画图

- boxplot
- scatterplot
- histogram
- pair plot
- heatmap

In [None]:
SAMPLE_SIZE = 0.1
#SAMPLE_SIZE = 0.01
cols = ['trip_distance', 'fare_amount', 'average_speed', 'time_duration', 'tip_amount']

In [None]:
sampled_df = sdf.sample(SAMPLE_SIZE, seed=1).toPandas()
sampled_df

In [None]:
box_plot(sampled_df)

In [None]:
def create_quan_df():
    colu = cols
    inde = ['Q0', 'Q1', 'Q3', 'Q4', 'IQR']
    return pd.DataFrame(index = inde, columns = colu)

In [None]:
# calculate the quantiles for each feature, fill them into the quantile dataframe
def find_quantile(df, quan_df):
    # Find the quantiles for each feature in the cols
    Q0 = df[cols].quantile(0.05)
    Q1 = df[cols].quantile(0.25)
    Q3 = df[cols].quantile(0.75)
    Q4 = df[cols].quantile(0.95)
    IQR = Q3 - Q1
    # fill the quantiles into a dataframe
    for fea in cols:    
        quan_df.loc["Q0", fea] = Q0[fea]
        quan_df.loc["Q1", fea] = Q1[fea]
        quan_df.loc["Q3", fea] = Q3[fea]
        quan_df.loc["Q4", fea] = Q4[fea]
        quan_df.loc["IQR", fea] = IQR[fea]
    return quan_df, Q1, Q3, IQR

In [None]:
quan_df = create_quan_df()
quan_df, Q1, Q3, IQR = find_quantile(sampled_df, quan_df)
quan_df

In [None]:
removed_df = sampled_df[~((sampled_df[cols] < (Q1 - 1.5*IQR)) | (sampled_df[cols] > (Q3 + 1.5*IQR))).any(axis=1)]
box_plot(removed_df)

画图也可以做Feature Analysis

思路： 自己提出一些假设设想，逐步验证

- Is tip amount itself a feature following certain distribution?
- Which of the numeric features are closest related to the final tip amount?
- Which of the chosen categorical features are closest related to the final tip amount?
- Are the Tip habit in 2019 and 2021 performing to be largely different?
- How are the tipping habit different with respect to the Pick Up & Drop Off Location?

In [None]:
pair_plot(sampled_df)

In [None]:
sbs.set_theme(style = 'darkgrid')
sbs.histplot(sampled_df['tip_amount'], kde = True)
plt.title('Tip Amount in 2019', size = 15)
plt.xlabel('Tip Amount', size = 13)
plt.savefig("plots/Tip Amount")
plt.show()

##### 3. Merge External Dataset

In [None]:
external_cols = ['Month,Date', 'Temperature (F)', 'Wind Speed (mph)', 'is_rainy']
external = spark.read.csv(f'data/NYC weather 2019 cleaned.csv', header = True)
# take the needed columns only
external = external.select(external_cols)
# merge the two datasets
merged = sdf.join(external, on = 'Month,Date', how = 'leftouter')
merged

##### 4. Geospatial Plot

In [None]:
# plot the average Tip amount in different PickUp and DropOff Locations in 2021
gdf, geoJSON = create_geo()
df_pu = create_proportion(sdf, 'PU', gdf)
m_pu = plot_map(df_pu, geoJSON, 'PU')
df_do = create_proportion(sdf, 'DO', gdf)
m_do = plot_map(df_do, geoJSON, 'DO')

##### 5. Analysis on Categorical feature and Continuous feature

- For Continuous features, use heatmap

- For Categorical feautres, use aggregate and ANOVA

In [None]:
aggregated_results = sdf \
                    .groupBy(cols) \
                    .agg(
                        # take the mean of each with respect to the combinations of categorical features
                        F.mean("tip_amount").alias("avg_tip_amount"),
                        F.mean("tip_rate").alias("avg_tip_rate")
                    ).orderBy('avg_tip_amount')
                    # order by Avg Tip Amount in ascending order
                    

aggregated_results.show()

##### 6. 统计模型
线性模型的四个假设
- 线性关系：自变量x和因变量y之间存在线性关系
- 独立：残差是独立的，不受其他因素影响（E.g 残差不应随着时间增长而增长）
- 同方差：对于每个x，残差都应是恒定的
- 正态分布：残差应符合正态分布

其他可以考虑使用的models：

Classification：
- LR
- SVM
- BNB
- RF

Regression：
- GLM
- RFR
- XGB
- MLP

Evaluation Metrics：
- MSE
- MAE
- RMSE
- Pearson Correlation
- Acc, Recall, Precision, F1
- ...

Evaluate using plots:
- Distribution of residuals
- QQ plot

In [None]:
fig, ax = plt.subplots()
sp.stats.probplot(plot['residual'], plot=ax, fit=True)
# save the figure
plt.savefig('../plots/QQ plot')
plt.show()