## Web Scraping (www.trulia.com): Processing the Data
##### by Sabbir Mohammed

Initializing packages: 

In [23]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
plt.style.use('ggplot')

Loading data set:

In [24]:
trulia = pd.read_csv('nyc_trulia_complete_lite.csv')

In [25]:
trulia.head(10)

Unnamed: 0,address,city,soldDate,soldPrice,sqft,zipCode
0,,,,,,
1,245 E 93rd St #26C,New York,21-Dec-18,"$1,399,999","1,056 sqft",10128.0
2,,,,,,
3,15 Broad St #2320,New York,21-Dec-18,"$2,150,000","1,772 sqft",10005.0
4,,,,,,
5,510 W 110th St #12D,New York,21-Dec-18,"$660,000",616 sqft,10025.0
6,,,,,,
7,405 W 50th St,New York,21-Dec-18,Contact For Estimate,"9,775 sqft",10019.0
8,,,,,,
9,407 W 50th St,New York,21-Dec-18,Contact For Estimate,"9,775 sqft",10019.0


Fixing DataFrame structure:

In [26]:
trulia = trulia.dropna(axis=0, how='all')

In [27]:
trulia = trulia.reset_index()

In [28]:
trulia = trulia.drop('index',1)

In [29]:
trulia.head()

Unnamed: 0,address,city,soldDate,soldPrice,sqft,zipCode
0,245 E 93rd St #26C,New York,21-Dec-18,"$1,399,999","1,056 sqft",10128.0
1,15 Broad St #2320,New York,21-Dec-18,"$2,150,000","1,772 sqft",10005.0
2,510 W 110th St #12D,New York,21-Dec-18,"$660,000",616 sqft,10025.0
3,405 W 50th St,New York,21-Dec-18,Contact For Estimate,"9,775 sqft",10019.0
4,407 W 50th St,New York,21-Dec-18,Contact For Estimate,"9,775 sqft",10019.0


Fixing 'zipCode':

In [30]:
trulia = trulia.astype({'zipCode':int})

In [31]:
trulia = trulia.astype({'zipCode':str})

Fixing 'sqft':

In [32]:
trulia.sqft = trulia.sqft.str.strip(' sqft')

In [33]:
trulia.sqft = trulia.sqft.str.replace(',','')

In [34]:
trulia = trulia.astype({'sqft': int})

Fixing 'soldPrice':

In [35]:
trulia.soldPrice = trulia.soldPrice.str.replace(',','')

In [36]:
trulia.soldPrice = trulia.soldPrice.str.strip('$')

In [37]:
trulia.soldPrice.loc[trulia['soldPrice'] == 'Contact For Estimate'] = np.nan

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)


In [38]:
trulia = trulia.astype({'soldPrice': float})

Adding 'borough':

In [39]:
boro_list = ['New York', 'Manhattan', 'Brooklyn','Bronx','Staten Island']
borough = pd.Series(['Queens' if trulia.city[i] not in boro_list else trulia.city[i] for i in range(len(trulia.city))])

In [40]:
trulia['borough'] = borough

In [41]:
trulia.borough.loc[trulia['city'] == 'New York'] = 'Manhattan'

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)


Fixing 'city':

In [42]:
trulia.city = trulia.city.str.replace('Manhattan', 'New York')

EXPORT:

In [22]:
trulia.to_csv('trulia_complete.csv')

CHECK

In [43]:
trulia.head()

Unnamed: 0,address,city,soldDate,soldPrice,sqft,zipCode,borough
0,245 E 93rd St #26C,New York,21-Dec-18,1399999.0,1056,10128,Manhattan
1,15 Broad St #2320,New York,21-Dec-18,2150000.0,1772,10005,Manhattan
2,510 W 110th St #12D,New York,21-Dec-18,660000.0,616,10025,Manhattan
3,405 W 50th St,New York,21-Dec-18,,9775,10019,Manhattan
4,407 W 50th St,New York,21-Dec-18,,9775,10019,Manhattan


In [44]:
type(trulia.zipCode[10])

str