# Working on weather data for a project

[Citrics](https://b.citrics.dev/) is a project that helps people decide before moving to a new city by providing them valuable informations on different cities. One of the core features of the project is being able see rental information and trend of different type of apartments of different cities. This notebook shows how the data was cleaned, wrangled and new features were created so that they can be used for getting rental info and to create predictive model to see trend towards the future.

The data were collected from [huduser.gov](https://www.huduser.gov/portal/datasets/fmr.html). Data for each year were collected saparately and then joined together

In [None]:
import pandas as pd
import numpy as np
#pd.set_option('display.max_rows',None)

In [None]:
#url_cities='./data/100city_population_data_2018.csv'
url= "/content/data/rental_2011.csv"
url_zumper_cities = '/content/data/100city_state_data.csv'
df = pd.read_csv(url)
df_cities = pd.read_csv(url_zumper_cities)


## areaname replaced in FMR Data
# From.          To
Urban Honolulu   Honolulu
                 Gilbert added in areaname for maricopa county
                 Glendale ""     ""
                 chandler ""     ""
Boise City       Boise
Winston-Salem    Winston Salem
                 
                 Plano, TX added in areaname for collin county
                 Chesapeake added in areaname for Chesapeake City County
                 Irving, TX added in areaname for dallas county

## City that was changed in cities_rental
# From           To
St Louis         St.Louis
St Petersburg    St.Petersburg

In [None]:
df.shape, df_cities.shape

((4765, 20), (100, 4))

In [None]:
df.columns

Index(['FIPS', 'fmr0', 'fmr1', 'fmr2', 'fmr3', 'fmr4', 'county', 'State',
       'CouSub', 'pop2000', 'countyname', 'Metro_code', 'Areaname',
       'county_town_name', 'ACS_2010_sm_2', 'state_alpha', 'fmr_type', 'metro',
       'FMR_PCT_Change', 'FMR_Dollar_Change'],
      dtype='object')

In [None]:
df.rename(columns = {'Areaname':'areaname', 'fmr0':'fmr_0', 'fmr1':'fmr_1', 'fmr2':'fmr_2', 'fmr3':'fmr_3', 'fmr4':'fmr_4', }, inplace = True) 

In [None]:
df_cities.columns

Index(['city_id', 'city', 'state', 'city_state'], dtype='object')

In [None]:
df.head()

Unnamed: 0,FIPS,fmr_0,fmr_1,fmr_2,fmr_3,fmr_4,county,State,CouSub,pop2000,countyname,Metro_code,areaname,county_town_name,ACS_2010_sm_2,state_alpha,fmr_type,metro,FMR_PCT_Change,FMR_Dollar_Change
0,100199999,552,653,735,975,1287,1,1,99999,43671.0,Autauga County,METRO33860M33860,"Montgomery, AL MSA",Autauga County,735.0,AL,40,1,1.0,0.0
1,100399999,534,643,764,1013,1160,3,1,99999,140415.0,Baldwin County,NCNTY01003N01003,"Baldwin County, AL",Baldwin County,764.0,AL,40,0,1.0,0.0
2,100599999,448,449,539,667,687,5,1,99999,29038.0,Barbour County,NCNTY01005N01005,"Barbour County, AL",Barbour County,539.0,AL,40,0,1.0,0.0
3,100799999,634,705,786,997,1027,7,1,99999,20826.0,Bibb County,METRO13820M13820,"Birmingham-Hoover, AL HUD Metro FMR Area",Bibb County,735.0,AL,40,1,1.069388,51.0
4,100999999,634,705,786,997,1027,9,1,99999,51024.0,Blount County,METRO13820M13820,"Birmingham-Hoover, AL HUD Metro FMR Area",Blount County,735.0,AL,40,1,1.069388,51.0


In [None]:
df.loc[df['countyname'] == 'Collin County']

Unnamed: 0,FIPS,fmr_0,fmr_1,fmr_2,fmr_3,fmr_4,county,State,CouSub,pop2000,countyname,Metro_code,areaname,county_town_name,ACS_2010_sm_2,state_alpha,fmr_type,metro,FMR_PCT_Change,FMR_Dollar_Change
3863,4808599999,666,738,891,1160,1372,85,48,99999,491675.0,Collin County,METRO19100M19100,"Dallas, TX HUD Metro FMR Area",Collin County,894.0,TX,40,1,0.996644,-3.0


In [None]:
df.loc[df['countyname'] == 'Chesapeake city']

Unnamed: 0,FIPS,fmr_0,fmr_1,fmr_2,fmr_3,fmr_4,county,State,CouSub,pop2000,countyname,Metro_code,areaname,county_town_name,ACS_2010_sm_2,state_alpha,fmr_type,metro,FMR_PCT_Change,FMR_Dollar_Change
4459,5155099999,800,834,965,1319,1590,550,51,99999,199184.0,Chesapeake city,METRO47260M47260,"Virginia Beach-Norfolk-Newport News, VA-NC MSA",Chesapeake city,934.0,VA,40,1,1.033191,31.0


In [None]:
df.loc[df['countyname'] == 'Dallas County']

Unnamed: 0,FIPS,fmr_0,fmr_1,fmr_2,fmr_3,fmr_4,county,State,CouSub,pop2000,countyname,Metro_code,areaname,county_town_name,ACS_2010_sm_2,state_alpha,fmr_type,metro,FMR_PCT_Change,FMR_Dollar_Change
23,104799999,355,493,547,690,740,47,1,99999,46365.0,Dallas County,NCNTY01047N01047,"Dallas County, AL",Dallas County,547.0,AL,40,0,1.0,0.0
130,503999999,347,481,536,683,869,39,5,99999,9210.0,Dallas County,NCNTY05039N05039,"Dallas County, AR",Dallas County,515.0,AR,40,0,1.040777,21.0
974,1904999999,508,606,739,947,1055,49,19,99999,40750.0,Dallas County,METRO19780M19780,"Des Moines-West Des Moines, IA MSA",Dallas County,737.0,IA,40,1,1.002714,2.0
2527,2905999999,352,457,541,738,762,59,29,99999,15661.0,Dallas County,METRO44180N29059,"Dallas County, MO HUD Metro FMR Area",Dallas County,517.0,MO,40,1,1.046422,24.0
3877,4811399999,666,738,891,1160,1372,113,48,99999,2218899.0,Dallas County,METRO19100M19100,"Dallas, TX HUD Metro FMR Area",Dallas County,894.0,TX,40,1,0.996644,-3.0


In [None]:
pd.set_option('display.max_colwidth', None)
df.loc[df['countyname'] == 'Los Angeles County'].areaname

204    Los Angeles-Long Beach, CA HUD Metro FMR Area
Name: areaname, dtype: object

In [None]:
df.loc[df['countyname'] == 'Maricopa County'].areaname

103    Phoenix-Mesa-Glendale, AZ MSA
Name: areaname, dtype: object

In [None]:
print(sorted(df.areaname.unique()))

['Abbeville County, SC', 'Abilene, TX MSA', 'Acadia Parish, LA', 'Accomack County, VA', 'Adair County, IA', 'Adair County, KY', 'Adair County, MO', 'Adair County, OK', 'Adams County, IA', 'Adams County, ID', 'Adams County, IL', 'Adams County, IN', 'Adams County, MS', 'Adams County, ND', 'Adams County, NE', 'Adams County, OH', 'Adams County, PA', 'Adams County, WA', 'Adams County, WI', 'Addison County, VT', 'Aguadilla-Isabela-San Sebastián, PR MSA', 'Aitkin County, MN', 'Akron, OH MSA', 'Alamosa County, CO', 'Albany County, WY', 'Albany, GA MSA', 'Albany-Schenectady-Troy, NY MSA', 'Albuquerque, NM MSA', 'Alcona County, MI', 'Alcorn County, MS', 'Aleutians East Borough, AK', 'Aleutians West Census Area, AK', 'Alexandria, LA MSA', 'Alfalfa County, OK', 'Alger County, MI', 'Allamakee County, IA', 'Allegan County, MI', 'Allegany County, NY', 'Alleghany County, NC', 'Alleghany County-Clifton Forge city-Covington city, VA HUD Nonmetro FMR Area', 'Allen County, KS', 'Allen County, KY', 'Allen 

In [None]:
df = df.replace(['Boise City', 'Phoenix-Mesa-Glendale, AZ MSA', 'Virginia Beach-Norfolk-Newport News, VA-NC MSA', 'Orange County, CA HUD Metro FMR Area', 'Las Vegas-Paradise, NV MSA'], ['Boise', 'Phoenix-Mesa-Scottsdale-Chandler-Gilbert-Glendale, AZ MSA', 'Chesapeake-Virginia Beach-Norfolk-Newport News, VA-NC MSA', 'Orange County-Santa Ana-Anaheim, CA HUD Metro FMR Area', 'Las Vegas-Henderson-Paradise, NV MSA'])

In [None]:
df.loc[df['countyname'] == 'Dallas County'] = df.loc[df['countyname'] == 'Dallas County'].replace(['Dallas, TX HUD Metro FMR Area'], ['Irving-Dallas, TX HUD Metro FMR Area'])

In [None]:
df.loc[df['countyname'] == 'Collin County'] = df.loc[df['countyname'] == 'Collin County'].replace(['Dallas, TX HUD Metro FMR Area'], ['Plano-Dallas, TX HUD Metro FMR Area'])

In [None]:
df['metro'].value_counts()

0    2852
1    1913
Name: metro, dtype: int64

In [None]:
metros_orig = df[df['metro'] == 1]

In [None]:
metros_orig.shape

(1913, 20)

In [None]:
metros_orig.areaname.head()

0                           Montgomery, AL MSA
3     Birmingham-Hoover, AL HUD Metro FMR Area
4     Birmingham-Hoover, AL HUD Metro FMR Area
7                      Anniston-Oxford, AL MSA
10       Chilton County, AL HUD Metro FMR Area
Name: areaname, dtype: object

In [None]:
#metros['areaname'] = metros['areaname'].replace("--","-")
#metros['areaname'].replace('Texarkana, TX-Texarkana, AR HUD Metro FMR Area','Texarkana,  AR HUD Metro FMR Area')

In [None]:
metros_orig[metros_orig['areaname']=='Texarkana, TX-Texarkana, AR HUD Metro FMR Area']

Unnamed: 0,FIPS,fmr_0,fmr_1,fmr_2,fmr_3,fmr_4,county,State,CouSub,pop2000,countyname,Metro_code,areaname,county_town_name,ACS_2010_sm_2,state_alpha,fmr_type,metro,FMR_PCT_Change,FMR_Dollar_Change


In [None]:
metros_orig['areaname'][metros_orig.Metro_code == 'METRO45500M45500'] = 'Texarkana, AR HUD Metro FMR Area'

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._where(~key, value, inplace=True)


In [None]:
metros_orig['areaname'][metros_orig.Metro_code == 'METRO45500M45500']

156     Texarkana, AR HUD Metro FMR Area
3839    Texarkana, AR HUD Metro FMR Area
Name: areaname, dtype: object

In [None]:
def create_cities(areaname):
    #print(areaname)
    areaname = areaname.replace("--","-")
    cities, garbage1 = areaname.split(",")
    areaname = cities.split("-")
    #print(f"{cities} ---- {list_cities}")
    return areaname

In [None]:
metros_orig['cities'] = metros_orig.areaname.apply(create_cities)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [None]:
metros_orig.head()

Unnamed: 0,FIPS,fmr_0,fmr_1,fmr_2,fmr_3,fmr_4,county,State,CouSub,pop2000,countyname,Metro_code,areaname,county_town_name,ACS_2010_sm_2,state_alpha,fmr_type,metro,FMR_PCT_Change,FMR_Dollar_Change,cities
0,100199999,552,653,735,975,1287,1,1,99999,43671.0,Autauga County,METRO33860M33860,"Montgomery, AL MSA",Autauga County,735.0,AL,40,1,1.0,0.0,[Montgomery]
3,100799999,634,705,786,997,1027,7,1,99999,20826.0,Bibb County,METRO13820M13820,"Birmingham-Hoover, AL HUD Metro FMR Area",Bibb County,735.0,AL,40,1,1.069388,51.0,"[Birmingham, Hoover]"
4,100999999,634,705,786,997,1027,9,1,99999,51024.0,Blount County,METRO13820M13820,"Birmingham-Hoover, AL HUD Metro FMR Area",Blount County,735.0,AL,40,1,1.069388,51.0,"[Birmingham, Hoover]"
7,101599999,426,471,585,773,909,15,1,99999,112249.0,Calhoun County,METRO11500M11500,"Anniston-Oxford, AL MSA",Calhoun County,585.0,AL,40,1,1.0,0.0,"[Anniston, Oxford]"
10,102199999,398,550,612,769,881,21,1,99999,39593.0,Chilton County,METRO13820N01021,"Chilton County, AL HUD Metro FMR Area",Chilton County,612.0,AL,40,1,1.0,0.0,[Chilton County]


In [None]:
metros_explode = metros_orig.explode('cities')

In [None]:
metros_explode.shape

(3230, 21)

In [None]:
metros_explode.columns

Index(['FIPS', 'fmr_0', 'fmr_1', 'fmr_2', 'fmr_3', 'fmr_4', 'county', 'State',
       'CouSub', 'pop2000', 'countyname', 'Metro_code', 'areaname',
       'county_town_name', 'ACS_2010_sm_2', 'state_alpha', 'fmr_type', 'metro',
       'FMR_PCT_Change', 'FMR_Dollar_Change', 'cities'],
      dtype='object')

In [None]:
metros_explode.head().T

Unnamed: 0,0,3,3.1,4,4.1
FIPS,100199999,100799999,100799999,100999999,100999999
fmr_0,552,634,634,634,634
fmr_1,653,705,705,705,705
fmr_2,735,786,786,786,786
fmr_3,975,997,997,997,997
fmr_4,1287,1027,1027,1027,1027
county,1,7,7,9,9
State,1,1,1,1,1
CouSub,99999,99999,99999,99999,99999
pop2000,43671,20826,20826,51024,51024


In [None]:
metros_explode[['areaname','cities','state_alpha']].head()

Unnamed: 0,areaname,cities,state_alpha
0,"Montgomery, AL MSA",Montgomery,AL
3,"Birmingham-Hoover, AL HUD Metro FMR Area",Birmingham,AL
3,"Birmingham-Hoover, AL HUD Metro FMR Area",Hoover,AL
4,"Birmingham-Hoover, AL HUD Metro FMR Area",Birmingham,AL
4,"Birmingham-Hoover, AL HUD Metro FMR Area",Hoover,AL


In [None]:
metros_explode['city_state'] = metros_explode['cities']+", "+metros_explode['state_alpha']

In [None]:
#df = metros_explode[['city_state','cities','state_alpha','countyname','fmr_0','fmr_1']]

In [None]:
#df.shape

In [None]:
#df['city_state'].head()

In [None]:
df_cities.head()

Unnamed: 0,city_id,city,state,city_state
0,0,Anchorage,AK,"Anchorage, AK"
1,1,Chandler,AZ,"Chandler, AZ"
2,2,Gilbert,AZ,"Gilbert, AZ"
3,3,Glendale,AZ,"Glendale, AZ"
4,4,Mesa,AZ,"Mesa, AZ"


In [None]:
metros_explode.rename(columns = {'cities':'city'}, inplace = True) 

In [None]:
metros = metros_explode[['city','fmr_0','fmr_1','fmr_2','fmr_3','fmr_4']]

In [None]:
metros.head()

Unnamed: 0,city,fmr_0,fmr_1,fmr_2,fmr_3,fmr_4
0,Montgomery,552,653,735,975,1287
3,Birmingham,634,705,786,997,1027
3,Hoover,634,705,786,997,1027
4,Birmingham,634,705,786,997,1027
4,Hoover,634,705,786,997,1027


In [None]:
metros.shape

(3230, 6)

In [None]:
metros.drop_duplicates(subset=['city'],keep='first',inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [None]:
metros.shape

(704, 6)

In [None]:
metros = metros.replace(['Boise City', 'Louisville/Jefferson County', 'Urban Honolulu', 'Winston'], ['Boise', 'Louisville', 'Honolulu', 'Winston Salem'])

In [None]:
metros['year']=2011
metros.head()

Unnamed: 0,city,fmr_0,fmr_1,fmr_2,fmr_3,fmr_4,year
0,Montgomery,552,653,735,975,1287,2011
3,Birmingham,634,705,786,997,1027,2011
3,Hoover,634,705,786,997,1027,2011
7,Anniston,426,471,585,773,909,2011
7,Oxford,426,471,585,773,909,2011


In [None]:
merged = df_cities.merge(metros, on="city", how="left")

In [None]:
merged.shape

(100, 10)

In [None]:
df1 = merged[merged.isna().any(axis=1)]
df1.head(50)

Unnamed: 0,city_id,city,state,city_state,fmr_0,fmr_1,fmr_2,fmr_3,fmr_4,year


In [None]:
merged.isnull().sum()

city_id       0
city          0
state         0
city_state    0
fmr_0         0
fmr_1         0
fmr_2         0
fmr_3         0
fmr_4         0
year          0
dtype: int64

In [None]:
merged.head()


Unnamed: 0,city_id,city,state,city_state,fmr_0,fmr_1,fmr_2,fmr_3,fmr_4,year
0,0,Anchorage,AK,"Anchorage, AK",726,826,1036,1492,1817,2011
1,1,Chandler,AZ,"Chandler, AZ",666,776,936,1363,1596,2011
2,2,Gilbert,AZ,"Gilbert, AZ",666,776,936,1363,1596,2011
3,3,Glendale,AZ,"Glendale, AZ",666,776,936,1363,1596,2011
4,4,Mesa,AZ,"Mesa, AZ",666,776,936,1363,1596,2011


In [None]:
merged.to_csv("/content/output/rental_data_2011.csv", index=False)

In [None]:
#rental_data = pd.read_csv("./data/rental_data_2020.csv")

In [None]:
#rental_data.head()

In [None]:

dfx = pd.read_csv("/content/data/rental_2020.csv")

dfx.head()

Unnamed: 0,fips2010,fmr_0,fmr_1,fmr_2,fmr_3,fmr_4,state,metro_code,areaname,county,cousub,countyname,county_town_name,pop2017,acs_2019_2,state_alpha,fmr_type,metro,fmr_pct_chg,fmr_dollar_chg
0,100199999,583,702,830,1047,1425,1,METRO33860M33860,"Montgomery, AL MSA",1,99999,Autauga County,Autauga County,55035,825,AL,40,1,0.006061,5
1,100399999,744,749,916,1251,1566,1,METRO19300M19300,"Daphne-Fairhope-Foley, AL MSA",3,99999,Baldwin County,Baldwin County,203360,888,AL,40,1,0.031531,28
2,100599999,477,481,633,789,925,1,NCNTY01005N01005,"Barbour County, AL",5,99999,Barbour County,Barbour County,26200,666,AL,40,0,-0.04955,-33
3,100799999,804,861,986,1291,1425,1,METRO13820M13820,"Birmingham-Hoover, AL HUD Metro FMR Area",7,99999,Bibb County,Bibb County,22580,873,AL,40,1,0.129439,113
4,100999999,804,861,986,1291,1425,1,METRO13820M13820,"Birmingham-Hoover, AL HUD Metro FMR Area",9,99999,Blount County,Blount County,57665,873,AL,40,1,0.129439,113


In [None]:
print(sorted(dfx.areaname.unique()))

['Abbeville County, SC', 'Abilene, TX MSA', 'Acadia Parish, LA HUD Metro FMR Area', 'Accomack County, VA', 'Adair County, IA', 'Adair County, KY', 'Adair County, MO', 'Adair County, OK', 'Adams County, IA', 'Adams County, ID', 'Adams County, IL', 'Adams County, IN', 'Adams County, MS', 'Adams County, ND', 'Adams County, NE', 'Adams County, OH', 'Adams County, WA', 'Adams County, WI', 'Addison County, VT', 'Aguadilla-Isabela, PR HUD Metro FMR Area', 'Aitkin County, MN', 'Akron, OH MSA', 'Alamosa County, CO', 'Albany County, WY', 'Albany, GA MSA', 'Albany, OR MSA', 'Albany-Schenectady-Troy, NY MSA', 'Albuquerque, NM MSA', 'Alcona County, MI', 'Alcorn County, MS', 'Aleutians East Borough, AK', 'Aleutians West Census Area, AK', 'Alexandria, LA MSA', 'Alfalfa County, OK', 'Alger County, MI', 'Allamakee County, IA', 'Allegan County, MI', 'Allegany County, NY', 'Alleghany County, NC', 'Alleghany County-Clifton Forge city-Covington city, VA HUD Nonmet', 'Allen County, KS', 'Allen County, KY HU

To see the notebook used to create predictive model, click [here](https://colab.research.google.com/drive/1vbXRInt0RJFJGQzAMc8JNBLhOkSNWIKb?usp=sharing)