## Predicting NYC January High Temperatures
### A Machine Learning Approach with Linear Regression
### Author: Nick Elias
* GitHub Project Repository: https://github.com/NickElias01/datafun-07-ml 

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

## Part 1 - Chart a Straight Line

"In this section, we’ll use a technique called simple linear regression to make predictions from time series data. We’ll use the 1895 through 2018 January average high temperatures in New York City to predict future average January high temperatures and to estimate the average January high temperatures for years preceding 1895." -Deitel and Deitel

In [None]:
c = lambda f: 5 / 9 * (f-32)
temps = [(f, c(f)) for f in range(0, 101, 10)]

temps_df = pd.DataFrame(temps, columns= ['Fahrenheit', 'Celsius'])

axes = temps_df.plot(x='Fahrenheit', y='Celsius', style='.-')
y_label = axes.set_ylabel('Celsius')

## Part 2 - Predict Avg High Temp in NYC in January

Let's use Linear Regression on Average High Temperatures in NYC in January.

### Section 1 - Data Acquisition

In [3]:
""" Load the January average high temperatures for New York City from 1895 through 2018 from NOAA’s “Climate at a Glance” time series in Data folder, taken from:
https://www.ncdc.noaa.gov/cag/ """

nyc_df = pd.read_csv('data/ave_hi_nyc_jan_1895-2018.csv')



### Section 2 - Data Inspection

In [None]:
nyc_df.head()

In [None]:
nyc_df.tail()

### Section 3 - Data Cleaning

In [None]:
# For readability, let’s rename the 'Value' column as 'Temperature':

nyc_df.columns = ['Date', 'Temperature', 'Anomaly']

nyc_df.head(3)

Seaborn labels the tick marks on the x-axis with Date values. 
Since this example processes only January temperatures, the x-axis labels will be more readable if they do not contain 01 (for January), we’ll remove it from each Date. 

First, let’s check the column’s type:

In [None]:
nyc_df.Date.dtype

In [None]:
# Truncating the date column by dividing all values in column by 100

nyc_df.Date = nyc_df.Date.floordiv(100)

nyc_df.head(3)

### Section 4 - Descriptive Statistics

In [None]:
pd.set_option('display.precision', 2)

nyc_df.Temperature.describe()

### Section 5 - Build the Model

The SciPy (Scientific Python) library is widely used for engineering, science and math in Python. Its stats module provides function linregress, which calculates a regression line’s slope and intercept for a given set of data points:
The object returned by linregress contains the regression line’s slope and intercept:

In [None]:
linear_regression = stats.linregress(
    x = nyc_df.Date,
    y = nyc_df.Temperature
    )

linear_regression.slope


In [None]:
linear_regression.intercept

### Section 6 - Predict

Let’s predict the average Fahrenheit temperature for January of 2024. 
In the following calculation, linear_regression.slope is m, 2019 is x (the date value for which you’d like to predict the temperature), and linear_regression.intercept is b:

In [None]:
linear_regression.slope * 2024 + linear_regression.intercept

We also can approximate what the average temperature might have been in the years before 1895. 
For example, let’s approximate the average temperature for January of 1890:

In [None]:
linear_regression.slope * 1890 + linear_regression.intercept

### Section 7 - Visualizations

Next, let’s use Seaborn’s regplot function to plot each data point with the dates on the x-axis and the temperatures on the y-axis.

In [None]:
sns.set_style('darkgrid')
axes = sns.regplot(x = nyc_df.Date, y = nyc_df.Temperature)

In [None]:
axes.set_ylim(10,70)

## Part 3 - Prediction

In this example, we’ll use the LinearRegression estimator from sklearn.linear_model. 
By default, this estimator uses all the numeric features in a dataset, performing a multiple linear regression (which we’ll discuss in the next section). 
Here, we perform simple linear regression using one feature as the independent variable. So, we’ll need to select one feature (the Date) from the dataset.

### Section 1 - Build the Model

In [16]:
X_train, X_test, y_train, y_test = train_test_split(
        nyc_df.Date.values.reshape(-1,1),
        nyc_df.Temperature.values,
            random_state=11)

In [None]:
X_train.shape

In [None]:
X_test.shape

In [None]:
linear_regression = LinearRegression()

linear_regression.fit(X=X_train, y=y_train)

In [None]:
linear_regression.coef_

In [None]:
linear_regression.intercept_

### Section 2 - Test the Model

In [None]:
predicted = linear_regression.predict(X_test)
expected = y_test

for p, e in zip(predicted[::5], expected[::5]):
    print(f'predicted: {p:.2f}, expected: {e:.2f}')

In [None]:
predict = (lambda x: linear_regression.coef_ * x +
           linear_regression.intercept_)

predict(2019)

### Section 3 - Predict

In [None]:
predict(1890)

Predicting the "average high temp in Jan" for the year 2024

In [None]:
predict(2024)

### Section 4 - Visualizations

In [None]:
axes = sns.scatterplot(data=nyc_df, x='Date', y='Temperature', hue='Temperature', palette='winter', legend=False)
axes.set_ylim(10, 70)
x = np.array([min(nyc_df.Date.values), max(nyc_df.Date.values)])

y = predict(x)

line = plt.plot(x, y)

## Part 4 - Insights

This code walks through a typical machine learning pipeline—starting from data acquisition, moving through model building and testing, and finally visualizing results using Python’s data science libraries.