# Table of Contents
### - Notebook set-up
### - Data cleaning
### - Looking for trends through scatterplots
### - Hypothesis
### - Creating a training and test set
### - Running regression over test set
### - Relationship analysis

# Setting Up the Notebook

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
import os
import sklearn
from sklearn.model_selection import train_test_split 
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

In [None]:
# Ensure matplotlib visualize appear in notebook
%matplotlib inline

In [None]:
# Set up path and import dataset
path = r'C:\Users\mmreg\OneDrive\Desktop\Data Analytics Course Work\Data Immersion\Tasks\08-2022 Exploratory Analytics Project\02 Data'
df = pd.read_csv(os.path.join(path, 'Prepared', 'citibike_clean_time.csv'), index_col = False)

In [None]:
# Ensure proper import
df.head()

# Question 3
## Clean your data so that it's ready for analysis.

In [None]:
# Check for missing values
df.isnull().sum()
# No missing values

In [None]:
# Check for duplicate records
dupes = df[df.duplicated()]

In [None]:
dupes
# No duplicates found

In [None]:
# I noticed outliers in the birth year column a couple of tasks ago. I will check on this now.
df['birth_year'].mean()

In [None]:
df['birth_year'].median()

In [None]:
df['birth_year'].mode()

In [None]:
df['birth_year'].min()

In [None]:
df['birth_year'].max()

In [None]:
# In the previous task, I asserted that 70 and younger was the demographic of the service. I wish to test to make sure this is correct.
df_test = df[df['birth_year'] < 1943]

In [None]:
df_test['birth_year'].value_counts(dropna = False)

In [None]:
# These records account for less than 0.003% of records with this error in birth year; I will remove them from the dataset to reduce skew
df = df[df['birth_year'] >= 1943]

In [None]:
# Confirm birth_year outliers have been neutralized by histogram
sns.histplot(df['birth_year'], bins = 25, kde = True)

In [None]:
# Check gender column for outliers
sns.histplot(df['gender'], bins = 15, kde = True)

In [None]:
# It is known that the 0 for undecided exists in the data: lets check how many there are
df['gender'].value_counts(dropna = False)

In [None]:
# Since there are only 2 unknown points, I will remove them fromt he dataset
df = df[df['gender'] >= 1]

In [None]:
# Ensure gender has been dropped
df['gender'].value_counts(dropna = False)

### All data has been cleaned and ready for analysis

# Question 4
## Explore your data visually (e.g., by way of a scatterplot), looking for variables whose relationship you’d like to test.

### The only correlation that had any sort of strength to it in 6.2 was the trip_duration and age. This will be looked at, among others

In [None]:
# Create scatterplot with trip_duration and age
sns.set(rc = {'figure.figsize':(20,12)})
df.plot(x = 'birth_year', y='trip_duration',style='o')
plt.title('Duration of Trip by Age')  
plt.xlabel('Birth Year')  
plt.ylabel('Trip Duration (Sec)')  
plt.show()

In [None]:
# Create scatterplot with trip_duration and gender
sns.set(rc = {'figure.figsize':(20,12)})
df.plot(x = 'trip_duration', y='gender',style='o')
plt.title('Duration of Trip by Gender')  
plt.xlabel('Trip Duration (Sec)')  
plt.ylabel('Gender (1: Male, 2: Female)')  
plt.show()

In [None]:
# Create scatterplot of start_time and birth_year
sns.set(rc = {'figure.figsize':(20,12)})
df.plot(x = 'birth_year', y='start_hour',style='o')
plt.title('Start of Trip by Age')  
plt.xlabel('Birth Year')  
plt.ylabel('Time of Trip Start')  
plt.show()

In [None]:
# Create scatterplot of start_time and gender
sns.set(rc = {'figure.figsize':(20,12)})
df.plot(x = 'gender', y='start_hour',style='o')
plt.title('Start of Trip by Gender')  
plt.xlabel('Gender')  
plt.ylabel('Time of Trip Start')  
plt.show()

# Question 5
## State your hypothesis in a markdown cell within your Jupyter notebook.

### There aren't many high correlation variables within the dataset; however, the one that looks most promising is the relationship between trip duration and birth year. Based on the scatterplot, I propose the following hypothesis:
### - The younger the customer (or the higher the birth year), the longer the duration of the trip.

# Question 6
## Reshape the variables into NumPy arrays, with X being the independent variable and y being the dependent variable.

### As a reference, the independent variable (X) will be the column 'birth_year', and the dependent variable (y) will be the column 'trip_duration'.

In [None]:
# Create variables for numpy array with chosen (in)dependent variables
X = df['birth_year'].values.reshape(-1,1)
y = df['trip_duration'].values.reshape(-1,1)

In [None]:
# Confirm proper creation of objects
X

In [None]:
y

# Question 7
## Split the data into two sets: a training set and a test set.

In [None]:
# Split into training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

# Question 8
## Run a linear regression on the data.
### a. First, fit the model you created to the training set.

In [None]:
# Create regression object
regression = LinearRegression()

In [None]:
# Fit the regression object to the training data
regression.fit(X_train, y_train)

### b. Then, create a prediction for y on the test set.

In [None]:
# Predict y values using X
y_predicted = regression.predict(X_test)

# Question 9
## Create a plot that shows the regression line on the test set.

In [None]:
# Plot predicted y-values
plot_test = plt
plot_test.scatter(X_test, y_test, color='gray', s = 15)
plot_test.plot(X_test, y_predicted, color='red', linewidth =2)
plot_test.title('Trip Duration by Age (Test set)')
plot_test.xlabel('Birth Year')
plot_test.ylabel('Trip Duration (Sec)')
plot_test.show()

# Question 10
## Write your own interpretation of how well the line appears to fit the data in a markdown cell.
### Looking at the line, I can immediately tell that there will not be a good fit with this regression. The vast majority of points are far away from the regression line, and it seems to be a hard task to try and estimate this.

# Question 11
## Check the model performance statistics—MSE and R2.

In [None]:
# Create the objects for the model summary stats
rmse = mean_squared_error(y_test, y_predicted)
r2 = r2_score(y_test, y_predicted)

In [None]:
# View the model summary statistics
print('Slope:' ,regression.coef_)
print('Mean squared error: ', rmse)
print('R2 score: ', r2)

# Question 12
## Compare the predicted y values with the actual y values in a dataframe.

In [None]:
# View predicted and actual y values in dataframe
data = pd.DataFrame({'Actual': y_test.flatten(), 'Predicted': y_predicted.flatten()})
data.head(30)

# Question 13
## Include your thoughts on how well the model performed on the test set in a markdown cell.

### Overall, the model performed quite horribly. The y-values that were predicted are very linear, while the plotted data suggests that the data is much more variable. With the removal of the birth years that are too few and far in between we removed most bias; however the linear regression model just does not work with this correlation.

In [None]:
# Save dataset for future analysis
df.to_csv(os.path.join(path, 'Prepared', 'citibike_clean.csv'))