# Working on weather data for a project

[Citrics](https://b.citrics.dev/) is a project that helps people decide before moving to a new city by providing them valuable informations on different cities. One of the core features of the project is being able to get weather information of different cities and compare them. This notebook shows how the data was cleaned, wrangled and new features were created so that they can be used for getting weather insights.

The data were collected from [World Weather Online](https://www.worldweatheronline.com/). Data for each cities were collected saparately and then joined together

This notebook shows how precipitation and snow data were retrieved. If you want to see how the average temperature and humidity data were retrieved click [here](https://colab.research.google.com/drive/1mgsddcrdNcRMAy2o95ifUuZSgcMrGfk0?usp=sharing)

In [None]:
import pandas as pd

An eample of how the data was manipulated is given below

In [None]:
df = pd.read_csv('akron.csv')
df.head()

Unnamed: 0,date_time,maxtempC,mintempC,totalSnow_cm,sunHour,uvIndex,moon_illumination,moonrise,moonset,sunrise,sunset,DewPointC,FeelsLikeC,HeatIndexC,WindChillC,WindGustKmph,cloudcover,humidity,precipMM,pressure,tempC,visibility,winddirDegree,windspeedKmph,location
0,2019-01-01 00:00:00,5,2,0.9,4.8,1,26,04:31 AM,03:17 PM,08:51 AM,06:08 PM,10,4,6,4,57,69,84,2.5,1002,4,8,225,27,akron
1,2019-01-01 01:00:00,5,2,0.9,4.8,1,26,04:31 AM,03:17 PM,08:51 AM,06:08 PM,9,5,7,5,56,79,84,1.3,1004,5,8,235,27,akron
2,2019-01-01 02:00:00,5,2,0.9,4.8,1,26,04:31 AM,03:17 PM,08:51 AM,06:08 PM,8,7,8,7,54,90,83,0.9,1006,7,8,245,27,akron
3,2019-01-01 03:00:00,5,2,0.9,4.8,1,26,04:31 AM,03:17 PM,08:51 AM,06:08 PM,6,8,9,8,53,100,83,1.2,1008,9,8,255,27,akron
4,2019-01-01 04:00:00,5,2,0.9,4.8,1,26,04:31 AM,03:17 PM,08:51 AM,06:08 PM,5,6,8,6,48,100,83,0.6,1010,7,9,265,25,akron


In [None]:
# Getting 'day', 'month' and 'year' column from 'data_time'

df['date_time'] = pd.to_datetime(df['date_time'], errors='coerce')
df['year'] = df['date_time'].dt.year
df['month'] = df['date_time'].dt.month
df['day'] = df['date_time'].dt.day
df.head()

Unnamed: 0,date_time,maxtempC,mintempC,totalSnow_cm,sunHour,uvIndex,moon_illumination,moonrise,moonset,sunrise,sunset,DewPointC,FeelsLikeC,HeatIndexC,WindChillC,WindGustKmph,cloudcover,humidity,precipMM,pressure,tempC,visibility,winddirDegree,windspeedKmph,location,year,month,day
0,2019-01-01 00:00:00,5,2,0.9,4.8,1,26,04:31 AM,03:17 PM,08:51 AM,06:08 PM,10,4,6,4,57,69,84,2.5,1002,4,8,225,27,akron,2019,1,1
1,2019-01-01 01:00:00,5,2,0.9,4.8,1,26,04:31 AM,03:17 PM,08:51 AM,06:08 PM,9,5,7,5,56,79,84,1.3,1004,5,8,235,27,akron,2019,1,1
2,2019-01-01 02:00:00,5,2,0.9,4.8,1,26,04:31 AM,03:17 PM,08:51 AM,06:08 PM,8,7,8,7,54,90,83,0.9,1006,7,8,245,27,akron,2019,1,1
3,2019-01-01 03:00:00,5,2,0.9,4.8,1,26,04:31 AM,03:17 PM,08:51 AM,06:08 PM,6,8,9,8,53,100,83,1.2,1008,9,8,255,27,akron,2019,1,1
4,2019-01-01 04:00:00,5,2,0.9,4.8,1,26,04:31 AM,03:17 PM,08:51 AM,06:08 PM,5,6,8,6,48,100,83,0.6,1010,7,9,265,25,akron,2019,1,1


In [None]:
# Shrinking the dataframe to keep necessary columns only

df = df[['location', 'date_time', 'year', 'month', 'day', 'totalSnow_cm', 'precipMM']]
df.head()

Unnamed: 0,location,date_time,year,month,day,totalSnow_cm,precipMM
0,akron,2019-01-01 00:00:00,2019,1,1,0.9,2.5
1,akron,2019-01-01 01:00:00,2019,1,1,0.9,1.3
2,akron,2019-01-01 02:00:00,2019,1,1,0.9,0.9
3,akron,2019-01-01 03:00:00,2019,1,1,0.9,1.2
4,akron,2019-01-01 04:00:00,2019,1,1,0.9,0.6


In [None]:
df.head()

Unnamed: 0,location,date_time,year,month,day,totalSnow_cm,precipMM
0,akron,2019-01-01 00:00:00,2019,1,1,0.9,2.5
1,akron,2019-01-01 01:00:00,2019,1,1,0.9,1.3
2,akron,2019-01-01 02:00:00,2019,1,1,0.9,0.9
3,akron,2019-01-01 03:00:00,2019,1,1,0.9,1.2
4,akron,2019-01-01 04:00:00,2019,1,1,0.9,0.6


In [None]:
df1 = df[['location', 'year']]
df1.head()

Unnamed: 0,location,year
0,akron,2019
1,akron,2019
2,akron,2019
3,akron,2019
4,akron,2019


In [None]:
# Getting the average precipitation and snow per day of the year(2019)

df = df.groupby(pd.Grouper(freq='D', key='date_time')).mean()
df.head()

Unnamed: 0_level_0,year,month,day,totalSnow_cm,precipMM
date_time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2019-01-01,2019,1,1,0.9,0.316667
2019-01-02,2019,1,2,0.3,0.004167
2019-01-03,2019,1,3,0.2,0.0125
2019-01-04,2019,1,4,0.0,0.0
2019-01-05,2019,1,5,0.0,0.0


In [None]:
df = df.rename(columns={"totalSnow_cm": "snowed", "precipMM": "rained"})
df.head()

Unnamed: 0_level_0,year,month,day,snowed,rained
date_time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2019-01-01,2019,1,1,0.9,0.316667
2019-01-02,2019,1,2,0.3,0.004167
2019-01-03,2019,1,3,0.2,0.0125
2019-01-04,2019,1,4,0.0,0.0
2019-01-05,2019,1,5,0.0,0.0


In [None]:
# Turning the 'snowed' and 'rained' column to boolian so that
# we can easily determine which of the days it snowed and/or rained

df['snowed'] = df['snowed'].where(df['snowed'] == 0, 1).astype(int)
df['rained'] = df['rained'].where(df['rained'] == 0, 1).astype(int)


df.head()

Unnamed: 0_level_0,year,month,day,snowed,rained
date_time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2019-01-01,2019,1,1,1,1
2019-01-02,2019,1,2,1,1
2019-01-03,2019,1,3,1,1
2019-01-04,2019,1,4,0,0
2019-01-05,2019,1,5,0,0


In [None]:
# Getting the total days snowed and rained

df['total_days_snowed'] = df['snowed'].sum()
df['total_days_rained'] = df['rained'].sum()

df.head()

Unnamed: 0_level_0,year,month,day,snowed,rained,total_days_snowed,total_days_rained
date_time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2019-01-01,2019,1,1,1,1,64,253
2019-01-02,2019,1,2,1,1,64,253
2019-01-03,2019,1,3,1,1,64,253
2019-01-04,2019,1,4,0,0,64,253
2019-01-05,2019,1,5,0,0,64,253


In [None]:
# Merging two dataframe to get the 'location' column

df = pd.merge(df1, df, how='outer')


df.head()

Unnamed: 0,location,year,month,day,snowed,rained,total_days_snowed,total_days_rained
0,akron,2019,1,1,1,1,64,253
1,akron,2019,1,2,1,1,64,253
2,akron,2019,1,3,1,1,64,253
3,akron,2019,1,4,0,0,64,253
4,akron,2019,1,5,0,0,64,253


In [None]:
# keeping the necessary columns only

df = df[['location', 'year', 'total_days_snowed', 'total_days_rained']]
df.head()

Unnamed: 0,location,year,total_days_snowed,total_days_rained
0,akron,2019,64,253
1,akron,2019,64,253
2,akron,2019,64,253
3,akron,2019,64,253
4,akron,2019,64,253


In [None]:
# Shrinking the dataframe to a single row dataframe as all the rows are the same and
# we will get what we need from one row only

df = df.head(1)
df.head()

Unnamed: 0,location,year,total_days_snowed,total_days_rained
0,akron,2019,64,253


In [None]:
# Downloading the data


#from google.colab import files
#all.to_csv(r'all.csv', index = False)
#files.download('all.csv')

This is the data retrieval process for one city only. The same code was run in a function(not shown here) to get data for all 100 cities that we worked on

[Click here](https://colab.research.google.com/drive/1dp2r_YvLkOO9zQlBjk6ILjtySUqWNNou?usp=sharing) to see part 2