# Working on weather data for a project

[Citrics](https://b.citrics.dev/) is a project that helps people decide before moving to a new city by providing them valuable informations on different cities. One of the core features of the project is being able to get weather information of different cities and compare them. This notebook shows how the data was cleaned, wrangled and new features were created so that they can be used for getting weather insights.

The data were collected from [World Weather Online](https://www.worldweatheronline.com/). Data for each cities were collected saparately and then joined together

In [None]:
import pandas as pd

An eample of how the data was manipulated is given below

In [None]:
df = pd.read_csv('akron.csv')
df.head()

Unnamed: 0,date_time,maxtempC,mintempC,totalSnow_cm,sunHour,uvIndex,moon_illumination,moonrise,moonset,sunrise,sunset,DewPointC,FeelsLikeC,HeatIndexC,WindChillC,WindGustKmph,cloudcover,humidity,precipMM,pressure,tempC,visibility,winddirDegree,windspeedKmph,location
0,2010-01-01,0,-7,1.5,4.8,1,100,07:27 PM,09:36 AM,08:51 AM,06:08 PM,-5,-10,-4,-10,25,100,92,1.8,1019,0,4,281,18,akron
1,2010-01-02,-8,-13,0.2,4.8,1,85,08:46 PM,10:16 AM,08:51 AM,06:09 PM,-12,-18,-10,-18,32,100,86,0.9,1023,-8,6,302,24,akron
2,2010-01-03,-8,-14,0.6,4.8,1,77,10:04 PM,10:49 AM,08:51 AM,06:10 PM,-12,-19,-11,-19,33,100,92,1.3,1022,-8,6,292,24,akron
3,2010-01-04,-5,-7,1.9,4.8,1,70,11:19 PM,11:18 AM,08:51 AM,06:11 PM,-7,-13,-6,-13,27,100,93,2.2,1017,-5,6,298,19,akron
4,2010-01-05,-5,-6,0.9,4.8,1,63,No moonrise,11:45 AM,08:51 AM,06:12 PM,-7,-11,-6,-11,21,100,93,1.1,1017,-5,5,291,15,akron


In [None]:
# Getting 'day', 'month' and 'year' column from 'data_time'

df['date_time'] = pd.to_datetime(df['date_time'], errors='coerce')
df['year'] = df['date_time'].dt.year
df['month'] = df['date_time'].dt.month
df['day'] = df['date_time'].dt.day
df.head()

Unnamed: 0,date_time,maxtempC,mintempC,totalSnow_cm,sunHour,uvIndex,moon_illumination,moonrise,moonset,sunrise,sunset,DewPointC,FeelsLikeC,HeatIndexC,WindChillC,WindGustKmph,cloudcover,humidity,precipMM,pressure,tempC,visibility,winddirDegree,windspeedKmph,location,year,month,day
0,2010-01-01,0,-7,1.5,4.8,1,100,07:27 PM,09:36 AM,08:51 AM,06:08 PM,-5,-10,-4,-10,25,100,92,1.8,1019,0,4,281,18,akron,2010,1,1
1,2010-01-02,-8,-13,0.2,4.8,1,85,08:46 PM,10:16 AM,08:51 AM,06:09 PM,-12,-18,-10,-18,32,100,86,0.9,1023,-8,6,302,24,akron,2010,1,2
2,2010-01-03,-8,-14,0.6,4.8,1,77,10:04 PM,10:49 AM,08:51 AM,06:10 PM,-12,-19,-11,-19,33,100,92,1.3,1022,-8,6,292,24,akron,2010,1,3
3,2010-01-04,-5,-7,1.9,4.8,1,70,11:19 PM,11:18 AM,08:51 AM,06:11 PM,-7,-13,-6,-13,27,100,93,2.2,1017,-5,6,298,19,akron,2010,1,4
4,2010-01-05,-5,-6,0.9,4.8,1,63,No moonrise,11:45 AM,08:51 AM,06:12 PM,-7,-11,-6,-11,21,100,93,1.1,1017,-5,5,291,15,akron,2010,1,5


In [None]:
# Shrinking the dataframe to keep necessary columns only

df = df[['location', 'date_time', 'year', 'month', 'day', 'maxtempC', 'mintempC', 'humidity']]
df.head()

Unnamed: 0,location,date_time,year,month,day,maxtempC,mintempC,humidity
0,akron,2010-01-01,2010,1,1,0,-7,92
1,akron,2010-01-02,2010,1,2,-8,-13,86
2,akron,2010-01-03,2010,1,3,-8,-14,92
3,akron,2010-01-04,2010,1,4,-5,-7,93
4,akron,2010-01-05,2010,1,5,-5,-6,93


In [None]:
# A function to convert temperature from celcius to fahrenheight

def f(x):
    x = x * 1.8 + 32
    return float(x)

In [None]:
# Applying the function

df['maxtempF'] = df['maxtempC'].apply(f)
df['mintempF'] = df['mintempC'].apply(f)
df.head()

Unnamed: 0,location,date_time,year,month,day,maxtempC,mintempC,humidity,maxtempF,mintempF
0,akron,2010-01-01,2010,1,1,0,-7,92,32.0,19.4
1,akron,2010-01-02,2010,1,2,-8,-13,86,17.6,8.6
2,akron,2010-01-03,2010,1,3,-8,-14,92,17.6,6.8
3,akron,2010-01-04,2010,1,4,-5,-7,93,23.0,19.4
4,akron,2010-01-05,2010,1,5,-5,-6,93,23.0,21.2


In [None]:
# A function to generate columns for different seasons out of the 'month' column

def getSeason(date):
    month = int(df1['month'])
    if (month > 11 or month <= 3):
       return "WINTER"
    elif (month == 4 or month == 5):
       return "SPRING"
    elif (month >=6 and month <= 9):
       return "SUMMER"
    else:
       return "FALL"

In [None]:
date = df.date_time.dt.month*100 + df.date_time.dt.day
df['season'] = (pd.cut(date,[0,321,620,922,1220,1300],
                       labels=['winter','spring','summer','autumn','winter '])
                  .str.strip()
               )

In [None]:
df['date_offset'] = (df.date_time.dt.month*100 + df.date_time.dt.day - 320)%1300

df['season'] = pd.cut(df['date_offset'], [0, 300, 602, 900, 1300], 
                      labels=['spring', 'summer', 'autumn', 'winter'])

In [None]:
df.head()

Unnamed: 0,location,date_time,year,month,day,maxtempC,mintempC,humidity,maxtempF,mintempF,season,date_offset
0,akron,2010-01-01,2010,1,1,0,-7,92,32.0,19.4,winter,1081
1,akron,2010-01-02,2010,1,2,-8,-13,86,17.6,8.6,winter,1082
2,akron,2010-01-03,2010,1,3,-8,-14,92,17.6,6.8,winter,1083
3,akron,2010-01-04,2010,1,4,-5,-7,93,23.0,19.4,winter,1084
4,akron,2010-01-05,2010,1,5,-5,-6,93,23.0,21.2,winter,1085


In [None]:
# Getting the dataframe for summer data and creating columns for the average temperatures and humidity

summer = df[df['season'] == 'summer']
summer.head()
summer['summer_maxtempF_mean'] = summer['maxtempF'].mean()
summer['summer_mintempF_mean'] = summer['mintempF'].mean()
summer['summer_humidity_mean'] = summer['humidity'].mean()
summer.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """


Unnamed: 0,location,date_time,year,month,day,maxtempC,mintempC,humidity,maxtempF,mintempF,season,date_offset,summer_maxtempF_mean,summer_mintempF_mean,summer_humidity_mean
171,akron,2010-06-21,2010,6,21,31,14,70,87.8,57.2,summer,301,79.914468,59.919149,76.843617
172,akron,2010-06-22,2010,6,22,31,19,76,87.8,66.2,summer,302,79.914468,59.919149,76.843617
173,akron,2010-06-23,2010,6,23,31,20,73,87.8,68.0,summer,303,79.914468,59.919149,76.843617
174,akron,2010-06-24,2010,6,24,27,18,80,80.6,64.4,summer,304,79.914468,59.919149,76.843617
175,akron,2010-06-25,2010,6,25,28,13,71,82.4,55.4,summer,305,79.914468,59.919149,76.843617


In [None]:
# Keeping the necessary  features only

summer = summer[['location', 'summer_maxtempF_mean', 'summer_mintempF_mean', 'summer_humidity_mean']]
summer.head()

Unnamed: 0,location,summer_maxtempF_mean,summer_mintempF_mean,summer_humidity_mean
171,akron,79.914468,59.919149,76.843617
172,akron,79.914468,59.919149,76.843617
173,akron,79.914468,59.919149,76.843617
174,akron,79.914468,59.919149,76.843617
175,akron,79.914468,59.919149,76.843617


In [None]:
# Shrinking the dataframe to a single row dataframe as all the rows are the same and
# we will get what we need from one row only

summer = summer.head(1)
summer.head()

Unnamed: 0,location,summer_maxtempF_mean,summer_mintempF_mean,summer_humidity_mean
171,akron,79.914468,59.919149,76.843617


In [None]:
# Getting the dataframe for winter data and creating columns for the average temperatures and humidity


winter = df[df['season'] == 'winter']
winter.head()
winter['winter_maxtempF_mean'] = winter['maxtempF'].mean()
winter['winter_mintempF_mean'] = winter['mintempF'].mean()
winter['winter_humidity_mean'] = winter['humidity'].mean()
winter.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """


Unnamed: 0,location,date_time,year,month,day,maxtempC,mintempC,humidity,maxtempF,mintempF,season,date_offset,winter_maxtempF_mean,winter_mintempF_mean,winter_humidity_mean
0,akron,2010-01-01,2010,1,1,0,-7,92,32.0,19.4,winter,1081,35.767489,22.239238,85.216368
1,akron,2010-01-02,2010,1,2,-8,-13,86,17.6,8.6,winter,1082,35.767489,22.239238,85.216368
2,akron,2010-01-03,2010,1,3,-8,-14,92,17.6,6.8,winter,1083,35.767489,22.239238,85.216368
3,akron,2010-01-04,2010,1,4,-5,-7,93,23.0,19.4,winter,1084,35.767489,22.239238,85.216368
4,akron,2010-01-05,2010,1,5,-5,-6,93,23.0,21.2,winter,1085,35.767489,22.239238,85.216368


In [None]:
# Keeping the necessary  features only


winter = winter[['location', 'winter_maxtempF_mean', 'winter_mintempF_mean', 'winter_humidity_mean']]
winter.head()

Unnamed: 0,location,winter_maxtempF_mean,winter_mintempF_mean,winter_humidity_mean
0,akron,35.767489,22.239238,85.216368
1,akron,35.767489,22.239238,85.216368
2,akron,35.767489,22.239238,85.216368
3,akron,35.767489,22.239238,85.216368
4,akron,35.767489,22.239238,85.216368


In [None]:
# Shrinking the dataframe to a single row dataframe as all the rows are the same and
# we will get what we need from one row only

winter = winter.head(1)
winter.head()

Unnamed: 0,location,winter_maxtempF_mean,winter_mintempF_mean,winter_humidity_mean
0,akron,35.767489,22.239238,85.216368


In [None]:
# Merging the two dataframe so that we can get both summer and winter data in one dataframe

all = pd.merge(summer, winter, on='location', how='inner')
all.head()

Unnamed: 0,location,summer_maxtempF_mean,summer_mintempF_mean,summer_humidity_mean,winter_maxtempF_mean,winter_mintempF_mean,winter_humidity_mean
0,akron,79.914468,59.919149,76.843617,35.767489,22.239238,85.216368


In [None]:
# Downloading the data

from google.colab import files
all.to_csv(r'all.csv', index = False)
files.download('all.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

This is the data retrieval process for one city only. The same code was run in a function(not shown here) to get data for all 100 cities that we worked on

[Click here](https://colab.research.google.com/drive/12t8tEJqOOZTM5cYhfup9r2n9WgeTgByY?usp=sharing) to see part 2