## Web Scraping (www.trulia.com): Processing the Data
##### by Sabbir Mohammed

Initializing packages: 

In [1]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
plt.style.use('ggplot')

Loading data set:

In [2]:
trulia = pd.read_csv('./data/nyc_trulia_complete_lite.csv')

In [3]:
trulia.head(10)

Unnamed: 0,address,city,soldDate,soldPrice,sqft,zipCode
0,,,,,,
1,245 E 93rd St #26C,New York,21-Dec-18,"$1,399,999","1,056 sqft",10128.0
2,,,,,,
3,15 Broad St #2320,New York,21-Dec-18,"$2,150,000","1,772 sqft",10005.0
4,,,,,,
5,510 W 110th St #12D,New York,21-Dec-18,"$660,000",616 sqft,10025.0
6,,,,,,
7,405 W 50th St,New York,21-Dec-18,Contact For Estimate,"9,775 sqft",10019.0
8,,,,,,
9,407 W 50th St,New York,21-Dec-18,Contact For Estimate,"9,775 sqft",10019.0


Fixing DataFrame structure:

In [4]:
trulia = trulia.dropna(axis=0, how='all')

In [5]:
trulia = trulia.reset_index()

In [6]:
trulia = trulia.drop('index',1)

In [7]:
trulia.head()

Unnamed: 0,address,city,soldDate,soldPrice,sqft,zipCode
0,245 E 93rd St #26C,New York,21-Dec-18,"$1,399,999","1,056 sqft",10128.0
1,15 Broad St #2320,New York,21-Dec-18,"$2,150,000","1,772 sqft",10005.0
2,510 W 110th St #12D,New York,21-Dec-18,"$660,000",616 sqft,10025.0
3,405 W 50th St,New York,21-Dec-18,Contact For Estimate,"9,775 sqft",10019.0
4,407 W 50th St,New York,21-Dec-18,Contact For Estimate,"9,775 sqft",10019.0


Fixing 'zipCode':

In [8]:
trulia = trulia.astype({'zipCode':int})

In [9]:
trulia = trulia.astype({'zipCode':str})

Fixing 'sqft':

In [10]:
trulia.sqft = trulia.sqft.str.strip(' sqft')

In [11]:
trulia.sqft = trulia.sqft.str.replace(',','')

In [12]:
trulia = trulia.astype({'sqft': int})

Fixing 'soldPrice':

In [13]:
trulia.soldPrice = trulia.soldPrice.str.replace(',','')

In [14]:
trulia.soldPrice = trulia.soldPrice.str.strip('$')

In [15]:
trulia.soldPrice.loc[trulia['soldPrice'] == 'Contact For Estimate'] = np.nan

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)


In [16]:
trulia = trulia.astype({'soldPrice': float})

Adding 'borough':

In [17]:
boro_list = ['New York', 'Manhattan', 'Brooklyn','Bronx','Staten Island']
borough = pd.Series(['Queens' if trulia.city[i] not in boro_list else trulia.city[i] for i in range(len(trulia.city))])

In [18]:
trulia['borough'] = borough

In [19]:
trulia.borough.loc[trulia['city'] == 'New York'] = 'Manhattan'

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)


Fixing 'city':

In [20]:
trulia.city = trulia.city.str.replace('Manhattan', 'New York')

EXPORT:

In [21]:
trulia.to_csv('./data/trulia_complete.csv')

CHECK

In [22]:
trulia.sample(10)

Unnamed: 0,address,city,soldDate,soldPrice,sqft,zipCode,borough
11663,81 Celeste Ct,Brooklyn,19-Dec-18,639000.0,990,11229,Brooklyn
8750,975 Montgomery St,Brooklyn,6-Jun-18,1000000.0,3280,11213,Brooklyn
1358,95 Saint Marys Ave,Staten Island,14-Sep-18,223080.0,847,10305,Staten Island
2109,66 Bianca Ct,Staten Island,17-Dec-18,412000.0,1510,10312,Staten Island
3825,5707 E Hampton Blvd,Flushing,7-Aug-18,995000.0,1240,11364,Queens
4465,10011 67th Rd #122,Forest Hills,19-Sep-18,296000.0,137120,11375,Queens
10842,1116 Brooklyn Ave,Brooklyn,26-Oct-18,525000.0,1782,11203,Brooklyn
9695,17 Dewey Pl,Brooklyn,15-Aug-18,895000.0,1625,11233,Brooklyn
104,330 E 33rd St #14F,New York,4-Jan-19,585000.0,485,10016,Manhattan
3679,10505 91st St,Jamaica,26-Jul-18,725000.0,1760,11417,Queens


In [23]:
type(trulia.zipCode[10])

str