# Predicting Australian Cities With Weather

Author: Ryan Harner

email: ryanharner413@gmail.com

Course Project, UC Irvine, Math 10, F22

## Introduction

In this project, I will be looking at the data of Australian cities and their weather to attempt to predict a certain aspect of the dataset. I will be using Pipeline and StandardScaler, LinearRegression, PoissonRegressor, and Lasso to understand how "MaxTemp" is affected by the other parts of this dataset.  

## Main Portion of the Project

Below I have all the libraries and modules needed for my project.

In [1]:
import pandas as pd
import altair as alt
import matplotlib.pyplot as plt

from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures

I used the weatherAUS.csv file from Kaggle.

In [2]:
df = pd.read_csv("weatherAUS.csv")
df[:3]

Unnamed: 0,Date,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,...,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RainTomorrow
0,2008-12-01,Albury,13.4,22.9,0.6,,,W,44.0,W,...,71.0,22.0,1007.7,1007.1,8.0,,16.9,21.8,No,No
1,2008-12-02,Albury,7.4,25.1,0.0,,,WNW,44.0,NNW,...,44.0,25.0,1010.6,1007.8,,,17.2,24.3,No,No
2,2008-12-03,Albury,12.9,25.7,0.0,,,WSW,46.0,W,...,38.0,30.0,1007.6,1008.7,,2.0,21.0,23.2,No,No


The columns are for the most part self-explainatory. The temperature columns are in Celcius, "Sunshine" is measured in hours, and other columns are measured with the metric system such as the "Evaporation" column with millimeters. Also the "Location" column is filled with Australian cities. If interested, check out [Reference](https://www.kaggle.com/datasets/jsphyg/weather-dataset-rattle-package) to see the description of the columns.

In [3]:
df.shape

(145460, 23)

There's 145460 rows and 23 columns in df.
To clean the data I choose to drop all the rows which had nan as an entry.

In [4]:
df = df.dropna(axis=0)

With is_numeric_dtype and list comprehension, I'm able to find the columns which have dtypes that are numerical. I told Xlist to keep only the first 6 using slicing.

In [5]:
from pandas.api.types import is_numeric_dtype
Xlist = [c for c in df.columns if is_numeric_dtype(df[c]) == True][:6]
Xlist

['MinTemp', 'MaxTemp', 'Rainfall', 'Evaporation', 'Sunshine', 'WindGustSpeed']

In [6]:
Xlist.append("Location") 
Xlist.append("Date") #adds Location and Date into the list
Xlist

['MinTemp',
 'MaxTemp',
 'Rainfall',
 'Evaporation',
 'Sunshine',
 'WindGustSpeed',
 'Location',
 'Date']

In [7]:
df_mini = df[Xlist] #creating DataFrame with strings in Xlist as columns

Using list comprehension to see what the unique locations are. (under the stipulation that the name of the city is less than 7 letters)

In [8]:
listcomp = [c for c in df_mini["Location"].unique() if len(c)<7] 
listcomp

['Cobar', 'Moree', 'Sydney', 'Sale', 'Cairns', 'Perth', 'Hobart', 'Darwin']

Boolean indexing allows us to shorten df_mini to a DataFrame with the entries for the "Location" column being the same as the strings in listcomp.

In [9]:
df_mini = df_mini[df_mini["Location"].isin(listcomp)].copy()

Using dtypes, I can see that the "Date" column has strings as entries. In following steps, I will make a new column "Datetime" that has datetime values as entries. Also I will drop the "Date" column and make df_mini have 5000 random rows.

In [10]:
df_mini.dtypes

MinTemp          float64
MaxTemp          float64
Rainfall         float64
Evaporation      float64
Sunshine         float64
WindGustSpeed    float64
Location          object
Date              object
dtype: object

In [11]:
datetimed = pd.to_datetime(df_mini["Date"]).to_frame() #This is a dataframe.

In [12]:
df_mini["Datetime"] = datetimed["Date"] #method 1 to get series values into DataFrame
df_mini = df_mini.drop("Date",axis=1).sample(5000, random_state=82623597).copy()
df_mini[:3]

Unnamed: 0,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustSpeed,Location,Datetime
120955,9.0,24.5,0.0,4.4,9.5,20.0,Perth,2009-05-14
33415,22.2,27.4,0.0,7.2,6.2,56.0,Sydney,2017-03-13
122006,17.7,23.8,40.8,2.2,8.7,39.0,Perth,2012-04-29


In [13]:
df_mini.shape

(5000, 8)

## Graphs from Altair

In [14]:
sel = alt.selection_single(fields=["Location"], bind="legend")
c1 = alt.Chart(df_mini).mark_circle().encode(
    x= "Datetime",
    y= "MaxTemp",
    color=alt.condition(sel,"Location", alt.value("grey")),
    opacity=alt.condition(sel, alt.value(1), alt.value(0.1)),
    tooltip=["Location","Datetime","MaxTemp"]
).properties(
    title='Max Temp Data'
).add_selection(sel)
c1


For the two charts below this is how to make the chart interactive: (You can scroll your mouse to zoom in and out; left click and drag to move)

In [15]:
sel = alt.selection_single(fields=["Location"], bind="legend")

c2 = alt.Chart(df_mini).mark_line().encode(
    x= "Datetime",
    y= "MaxTemp",
    color=alt.condition(sel,"Location", alt.value("grey")),
    opacity=alt.condition(sel, alt.value(0.65), alt.value(0.1)),
    tooltip=["Location","Datetime","MaxTemp"]
).add_selection(sel).interactive()
c2

In [16]:
c1+c2

For these graphs I focused on how the "MaxTemp" was changing over time with respect to the "Location". Looking at the graphs, I noticed that for some cities that there's long horizontal lines from lack of data. I talk more about this below in the caption for another graph, but to summarize, this is a result of how I cleaned the data with dropna().

Also there is definitely a pattern in these graphs. Although there isn't a positive or negative trend over the course of the timeframe, the points seem to make a zig-zagging pattern that seems to correspond to the month/season. 


Below I make a smaller dataframe consisting of only the rows which are in the year 2014. I do this because I eventually want to see how the columns affect "MaxTemp" within one year.

In [17]:
df_mini = df_mini[(df_mini["Datetime"].dt.year==2014)].copy()

In [18]:
sel = alt.selection_single(fields=["Location"], bind="legend")
c2 = alt.Chart(df_mini).mark_circle().encode(
    x= "Datetime",
    y= "MaxTemp", 
    color=alt.condition(sel, "Location", alt.value("grey")),
    opacity=alt.condition(sel, alt.value(1), alt.value(0.1)),
    tooltip=["Datetime"]
).add_selection(sel)
c2

The graph's points are widely spread out, however they make a dipping pattern similar to a flattened out x^2 graph.

Creating new columns "Month" for df_mini which gives the month a number.

In [19]:
df_mini["Month"]=df_mini["Datetime"].dt.month.values.copy()
df_mini
#method 2 to get series values into DataFrame

Unnamed: 0,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustSpeed,Location,Datetime,Month
89199,13.3,26.2,0.0,6.0,9.9,33.0,Cairns,2014-08-20,8
141116,25.3,32.9,0.0,2.8,10.0,35.0,Darwin,2014-03-26,3
89177,18.1,26.6,0.8,1.2,10.4,50.0,Cairns,2014-07-29,7
122796,12.6,20.7,0.0,3.0,8.9,37.0,Perth,2014-08-26,8
62941,8.4,26.6,0.0,7.4,12.2,39.0,Sale,2014-01-23,1
...,...,...,...,...,...,...,...,...,...
141361,24.1,33.4,13.8,6.6,5.9,61.0,Darwin,2014-11-26,11
88987,23.2,30.9,0.0,5.4,8.9,28.0,Cairns,2014-01-20,1
131988,3.2,12.5,0.0,0.8,6.6,22.0,Hobart,2014-08-18,8
141295,23.3,32.7,0.0,9.2,11.0,35.0,Darwin,2014-09-21,9


In the following steps, I use groupby to get the averages for each month for columns "MaxTemp" and "MinTemp".

In [20]:
df_mon = df_mini.groupby("Month").mean()[["MaxTemp","MinTemp"]]
df_mon


Unnamed: 0_level_0,MaxTemp,MinTemp
Month,Unnamed: 1_level_1,Unnamed: 2_level_1
1,31.048077,19.7
2,30.76,20.035556
3,28.877778,19.051111
4,26.363462,16.65
5,24.345946,14.932432
6,21.52449,12.718367
7,20.87234,10.774468
8,21.805882,11.105882
9,24.478182,13.316364
10,27.425424,16.874576


In [21]:
print(f''' 
min MaxTemp: {min(df_mon["MaxTemp"])},  month: {(df_mon["MaxTemp"]).argmin()}
min MinTemp: {min(df_mon["MinTemp"])},  month: {(df_mon["MinTemp"]).argmin()}
''')

 
min MaxTemp: 20.872340425531913,  month: 6
min MinTemp: 10.774468085106381,  month: 6



In the above steps, by using groupby on the Month column of df_mini, we can see that the temperature for MaxTemp and MinTemp are lowest at month 6 and 7, respectively, which are June and July. These are winter months for Australia. This makes sense because winter is typically the coldest season.

In [22]:
alt.Chart(df_mini,title="2016 Max Temperature (C) in Australian Cities").mark_rect().encode(
    x="Month:O",
    y="Location:O",
    color=alt.Color('MaxTemp:Q', scale=alt.Scale(scheme="inferno")),
    tooltip=[alt.Tooltip('MaxTemp:Q', title='Max Temp')]
).properties(width=550)

Note that the colors correspond to "Max Temp" in degrees Celsius. The darker colors indicate a cooler temperature, and the warmer, yellow colors indicate that it is hot.

This colorful graph does more than just look pretty. It not only displays the temperatures of cities for each month in 2016, but it also tells a story about the data that was used. In the graph, there's missing data for the cities Sale and Hobart. This is from taking away the rows with nan values at the beginning of my project. From April to December, Sale has nan values in columns such as "Evaporation" making it get cut out of the df_mini dataset when I used dropna. Hobart also has similar things from January to April.


Interpreting this map, we can also see that Hobart most likely is the coldest out of the cities since it has the lowest overall max temperature (C). However, Hobart's data will be affected since it is missing data from the summer. It is the most southern city (it is in Tasmania) so presumably the summers will be warmer than other cities and winters will be colder. Darwin would most likely be the hottest as its max temperature never drops below 27.3 degrees Celcius.
 
[Reference](https://altair-viz.github.io/gallery/weather_heatmap.html): Heatmap from Altair

## Standard Scaler


I create a list using list comprehension of all the column names that have numeric dtypes.
I then add "Location" and remove "MaxTemp" as columns because I want to predict what the "MaxTemp" from the other columns.

In [23]:
cols = [c for c in df_mini.columns if is_numeric_dtype(df_mini[c]) == True]
cols.remove("MaxTemp")
cols

['MinTemp', 'Rainfall', 'Evaporation', 'Sunshine', 'WindGustSpeed', 'Month']

In [24]:
df_mini2 = df_mini.copy()

Using StandardScaler() to fit then transform df_mini[cols] so that the mean of the columns (df_mini[cols]) is 0 and the standard deviation of the columns is 1. 

In [25]:
scaler = StandardScaler() # mean=0 and std=1

In [26]:
scaler.fit(df_mini[cols])

In [27]:
df_mini2[cols] = scaler.transform(df_mini[cols])
df_mini2

Unnamed: 0,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustSpeed,Location,Datetime,Month
89199,-0.442614,26.2,-0.294668,-0.028919,0.490930,-0.583096,Cairns,2014-08-20,0.382545
141116,1.448493,32.9,-0.294668,-0.765512,0.519041,-0.419105,Darwin,2014-03-26,-1.049583
89177,0.313829,26.6,-0.198020,-1.133809,0.631487,0.810832,Cairns,2014-07-29,0.096119
122796,-0.552928,20.7,-0.294668,-0.719475,0.209816,-0.255113,Perth,2014-08-26,0.382545
62941,-1.214816,26.6,-0.294668,0.293341,1.137492,-0.091122,Sale,2014-01-23,-1.622434
...,...,...,...,...,...,...,...,...,...
141361,1.259383,33.4,1.372494,0.109193,-0.633525,1.712785,Darwin,2014-11-26,1.241821
88987,1.117550,30.9,-0.294668,-0.167030,0.209816,-0.993075,Cairns,2014-01-20,-1.622434
131988,-2.034295,12.5,-0.294668,-1.225883,-0.436746,-1.485050,Hobart,2014-08-18,0.382545
141295,1.133309,32.7,-0.294668,0.707675,0.800155,-0.419105,Darwin,2014-09-21,0.668970


As seen below, for each column the mean is near 0 and the std is near 1.

In [28]:
df_mini2[cols].mean()

MinTemp         -1.198217e-16
Rainfall        -2.695988e-17
Evaporation      4.193760e-17
Sunshine        -2.516256e-16
WindGustSpeed    1.977058e-16
Month           -8.387520e-17
dtype: float64

In [29]:
df_mini2[cols].std()

MinTemp          1.000844
Rainfall         1.000844
Evaporation      1.000844
Sunshine         1.000844
WindGustSpeed    1.000844
Month            1.000844
dtype: float64

## Linear Regression
Next we use LinearRegression(). We fit then predict. 

In [30]:
reg = LinearRegression()

In [31]:
reg.fit(df_mini2[cols], df_mini[["MaxTemp"]])

Setting a column in df_mini2 called "pred" to be equal to the Linear Regression predict of df_mini2[cols].

In [32]:
df_mini2["pred"] = reg.predict(df_mini2[cols])
df_mini2[:3]

Unnamed: 0,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustSpeed,Location,Datetime,Month,pred
89199,-0.442614,26.2,-0.294668,-0.028919,0.49093,-0.583096,Cairns,2014-08-20,0.382545,25.111805
141116,1.448493,32.9,-0.294668,-0.765512,0.519041,-0.419105,Darwin,2014-03-26,-1.049583,34.467149
89177,0.313829,26.6,-0.19802,-1.133809,0.631487,0.810832,Cairns,2014-07-29,0.096119,28.06405


Below, I graph both the prediction and the "MaxTemp". It looks similar to the graph for "MaxTemp".

In [33]:
c3 = alt.Chart(df_mini2).mark_line().encode(
    x= "Datetime",
    y= "pred", 
    tooltip=["Datetime"]
)

In [34]:
c4 = alt.Chart(df_mini2).mark_line().encode(
    x= "Datetime",
    y= "MaxTemp", 
    tooltip=["Datetime"])
c3|c4

## Pipeline
Pipeline is a way faster process of combining StandardScaler() and any type of regression. It requires a lot less code.

In [35]:
pipe = Pipeline(
    [
        ("scaler", StandardScaler()),
        ("reg", LinearRegression())
    ]
)

In [36]:
pipe.fit(df_mini[cols],df_mini["MaxTemp"])

In [37]:
pipe.predict(df_mini[cols])

array([25.11180492, 34.4671491 , 28.06404967, 23.24388546, 24.60659851,
       25.63241793, 28.47856407, 31.25363106, 27.95678221, 23.28560335,
       16.99981648, 30.66529502, 26.41665912, 21.35954964, 32.44412481,
       30.99253536, 23.88620124, 13.1172668 , 33.92676117, 34.1584419 ,
       23.15595835, 26.38314459, 19.29818682, 25.69450172, 17.55670647,
       14.71606152, 25.82657478, 23.22298781, 22.78431273, 27.17429432,
       24.60619193, 18.96278159, 28.01657322, 35.8924847 , 30.51995875,
       22.49907529, 31.51448768, 32.21935974, 32.7180799 , 25.91273469,
       26.33792628, 26.12468334, 32.36652342, 32.29533213, 28.89832845,
       30.11857077, 12.50780271, 20.07715154, 23.33436236, 17.48145606,
       27.44350297, 26.72049109, 15.93125125, 34.66040528, 20.37139474,
       14.48894165, 19.62557181, 22.0144519 , 29.55341072, 18.4840181 ,
       24.09524684, 32.49921767, 34.61680769, 27.4543947 , 24.46048158,
       32.21068784, 28.73042989, 17.69850502, 31.54618215, 24.31

3 cells of code is all that is needed to use Pipeline.
Below, I inserted the predicted values into a column called "pred2" in df_mini2. 

In [38]:
df_mini2["pred2"]=pipe.predict(df_mini[cols])

Also, in the following line we can see that the "pred" column has all the same values as the "pred2" column in df_mini2. This is proof that Pipeline did the same thing as Standard Scaler first, then LinearRegression. 

In [39]:
(df_mini2["pred"]==df_mini2["pred2"]).all()

True

The coefficients and intercept are listed below. "MinTemp", "Sunshine", and "Evaporation" all have positive coefficients so I will use them to compare to cols and see how well they predict.  

In [40]:
reg.coef_

array([[ 4.77635062, -0.1989574 ,  1.01691609,  2.24238717, -0.02565457,
        -0.70732408]])

In [41]:
reg.coef_.shape

(1, 6)

In [42]:
pd.Series(reg.coef_.reshape(-1), index=reg.feature_names_in_)

MinTemp          4.776351
Rainfall        -0.198957
Evaporation      1.016916
Sunshine         2.242387
WindGustSpeed   -0.025655
Month           -0.707324
dtype: float64

In [43]:
reg.intercept_

array([26.35143339])

The score tells us how well the prediction does. Closer to 1, the better.

In [44]:
pipe.score(df_mini[cols],df_mini["MaxTemp"])

0.7997639618096023

## PoissonRegressor
Using PoissonRegressor() to predict "MaxTemp" with the cols list then with just "MinTemp", "Sunshine", and "Evaporation". 

In [45]:
from sklearn.linear_model import PoissonRegressor

In [46]:
pipe = Pipeline(
    [
        ("scaler", StandardScaler()),
        ("pois", PoissonRegressor())
    ]
)

In [47]:
pipe.fit(df_mini[cols],df_mini["MaxTemp"])

In [48]:
df_mini2["pred3"] = pipe.predict(df_mini[cols])
df_mini2

Unnamed: 0,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustSpeed,Location,Datetime,Month,pred,pred2,pred3
89199,-0.442614,26.2,-0.294668,-0.028919,0.490930,-0.583096,Cairns,2014-08-20,0.382545,25.111805,25.111805,24.503057
141116,1.448493,32.9,-0.294668,-0.765512,0.519041,-0.419105,Darwin,2014-03-26,-1.049583,34.467149,34.467149,35.065473
89177,0.313829,26.6,-0.198020,-1.133809,0.631487,0.810832,Cairns,2014-07-29,0.096119,28.064050,28.064050,27.499847
122796,-0.552928,20.7,-0.294668,-0.719475,0.209816,-0.255113,Perth,2014-08-26,0.382545,23.243885,23.243885,22.833752
62941,-1.214816,26.6,-0.294668,0.293341,1.137492,-0.091122,Sale,2014-01-23,-1.622434,24.606599,24.606599,23.955344
...,...,...,...,...,...,...,...,...,...,...,...,...
141361,1.259383,33.4,1.372494,0.109193,-0.633525,1.712785,Darwin,2014-11-26,1.241821,29.861739,29.861739,29.567048
88987,1.117550,30.9,-0.294668,-0.167030,0.209816,-0.993075,Cairns,2014-01-20,-1.622434,33.221566,33.221566,33.371385
131988,-2.034295,12.5,-0.294668,-1.225883,-0.436746,-1.485050,Hobart,2014-08-18,0.382545,14.235093,14.235093,16.150914
141295,1.133309,32.7,-0.294668,0.707675,0.800155,-0.419105,Darwin,2014-09-21,0.668970,33.874617,33.874617,34.298310


In [49]:
pipe.score(df_mini[cols],df_mini["MaxTemp"])

0.7688083582478654

Above, you can see that the score is lower when using Poisson Regressor than when I used Linear Regression.

Below, I am trying to use less columns to see if it affects the predict and score.

In [50]:
pipe.fit(df_mini[["MinTemp","Evaporation","Sunshine"]],df_mini["MaxTemp"])

In [51]:
pipe.predict(df_mini[["MinTemp","Evaporation","Sunshine"]])

array([24.66571687, 34.20879157, 27.61003155, 23.02854555, 22.82376875,
       25.9498588 , 27.62548724, 30.86231634, 27.49623567, 22.56935588,
       18.17665532, 29.31108894, 26.72477169, 20.2894126 , 33.94097311,
       29.19967209, 23.86267716, 15.6165626 , 33.6610098 , 36.03452942,
       22.12359397, 24.41929994, 19.47195449, 25.30524483, 17.55403011,
       16.48318184, 25.45283019, 23.2471915 , 22.27727317, 25.9436926 ,
       24.73608438, 19.68028036, 28.20024264, 35.52535358, 29.35059837,
       22.68976816, 33.52483839, 30.65022025, 33.64123955, 25.38893588,
       25.70847601, 26.03312007, 31.36098189, 31.87701812, 28.25068994,
       29.18948605, 15.14589709, 21.321386  , 23.61190657, 18.00670287,
       27.45958235, 26.76752194, 17.20734116, 34.63270217, 20.79638375,
       16.52998143, 19.3793249 , 21.91294448, 27.65877381, 19.10171201,
       23.90376461, 31.52739808, 33.80621809, 27.55904414, 24.59963112,
       33.26375341, 28.93899261, 18.60671764, 32.66481136, 23.19

In [52]:
pipe.score(df_mini[["MinTemp","Evaporation","Sunshine"]],df_mini["MaxTemp"])

0.7573496399072459

The score after using Poisson Regression is about 0.05 less than when using Linear Regression. This means that Linear Regression is a better Regression model to use for the this data. A reasons why I assume this is the case is because in order to use Poisson Regression, it assumes that the variance is equal to the mean. We also said that the mean is zero when we did Standard Scaler.

Also when using less columns for the training data, the score goes down. This makes sense because intuitively using more data should give better results.

[Reference](https://scikit-learn.org/stable/auto_examples/linear_model/plot_poisson_regression_non_normal_loss.html)

## Lasso
Trying Lasso to see if it works better.

In [53]:
from sklearn.linear_model import Lasso

In [54]:
pipe = Pipeline(
    [
        ("scaler", StandardScaler()),
        ("lasso", Lasso())
    ]
)

In [55]:
pipe.fit(df_mini[cols],df_mini["MaxTemp"])

In [56]:
df_mini2["pred4"] = pipe.predict(df_mini[cols])
df_mini2

Unnamed: 0,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustSpeed,Location,Datetime,Month,pred,pred2,pred3,pred4
89199,-0.442614,26.2,-0.294668,-0.028919,0.490930,-0.583096,Cairns,2014-08-20,0.382545,25.111805,25.111805,24.503057,25.237897
141116,1.448493,32.9,-0.294668,-0.765512,0.519041,-0.419105,Darwin,2014-03-26,-1.049583,34.467149,34.467149,35.065473,32.440049
89177,0.313829,26.6,-0.198020,-1.133809,0.631487,0.810832,Cairns,2014-07-29,0.096119,28.064050,28.064050,27.499847,27.872262
122796,-0.552928,20.7,-0.294668,-0.719475,0.209816,-0.255113,Perth,2014-08-26,0.382545,23.243885,23.243885,22.833752,24.054934
62941,-1.214816,26.6,-0.294668,0.293341,1.137492,-0.091122,Sale,2014-01-23,-1.622434,24.606599,24.606599,23.955344,23.203801
...,...,...,...,...,...,...,...,...,...,...,...,...,...
141361,1.259383,33.4,1.372494,0.109193,-0.633525,1.712785,Darwin,2014-11-26,1.241821,29.861739,29.861739,29.567048,30.572726
88987,1.117550,30.9,-0.294668,-0.167030,0.209816,-0.993075,Cairns,2014-01-20,-1.622434,33.221566,33.221566,33.371385,31.010369
131988,-2.034295,12.5,-0.294668,-1.225883,-0.436746,-1.485050,Hobart,2014-08-18,0.382545,14.235093,14.235093,16.150914,16.998344
141295,1.133309,32.7,-0.294668,0.707675,0.800155,-0.419105,Darwin,2014-09-21,0.668970,33.874617,33.874617,34.298310,32.332803


In [57]:
pipe.score(df_mini[cols],df_mini["MaxTemp"])

0.7388354956518769

Out of all the Regression and linear models, Lasso worked the worst in terms of predicting "MaxTemp" using the columns from df_mini that were in the list cols.

[Reference](https://scikit-learn.org/stable/modules/linear_model.html#lasso)

## Summary

In the Altair section, I displayed charts of the "MaxTemp" in relation to time ("Datetime") and the cities ("Location"). In the machine learning section, I used Standard Scaler, Pipeline, Linear Regression, Poisson Regressor, and Lasso to predict the "MaxTemp". I showed that for my data, Linear Regression worked best and that using more columns allowed the predict and score to be better.

## References

Your code above should include references.  Here is some additional space for references.

* What is the source of your dataset(s)?
[Reference](https://www.kaggle.com/datasets/jsphyg/weather-dataset-rattle-package)



* List any other references that you found helpful.

## Submission

Using the Share button at the top right, **enable Comment privileges** for anyone with a link to the project. Then submit that link on Canvas.

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=dede00de-9e69-4dc4-a447-70ffdf9c9c8f' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>