# EDA


- The os module has a perfect method to list files in a directory.
- Pandas json normalize could work here but is not necessary to convert the JSON data to a dataframe.
- You may need a nested for-loop to access each sale!
- We've put a lot of time into creating the structure of this repository, and it's a good example for future projects.  In the file functions_variables.py, there is an example function that you can import and use.  If you have any variables, functions or classes that you want to make, they can be put in the functions_variables.py file and imported into a notebook.  Note that only .py files can be imported into a notebook. If you want to import everything from a .py file, you can use the following:
```python
from functions_variables import *
```
If you just import functions_variables, then each object from the file will need to be prepended with "functions_variables"\
Using this .py file will keep your notebooks very organized and make it easier to reuse code between notebooks.

In [1]:
# (this is not an exhaustive list of libraries)
import pandas as pd
import numpy as np
import os
import json
from pprint import pprint
from functions_variables import encode_tags
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import functions_variables as fv

## Data Importing

In [2]:
# load one file first to see what type of data you're dealing with and what attributes it has
with open('..\data\AK_Juneau_4.json', 'r') as f:
    data = json.load(f)
#show me the data
pprint(data)
#create a data frame to see what the information is
testing_df = pd.DataFrame(data['data']['results'])


{'data': {'count': 4,
          'results': [{'branding': [{'name': None,
                                     'photo': None,
                                     'type': 'Office'}],
                       'community': None,
                       'description': {'baths': None,
                                       'baths_1qtr': None,
                                       'baths_3qtr': None,
                                       'baths_full': None,
                                       'baths_half': None,
                                       'beds': None,
                                       'garage': None,
                                       'lot_sqft': None,
                                       'name': None,
                                       'sold_date': '2023-08-21',
                                       'sold_price': None,
                                       'sqft': None,
                                       'stories': None,
                                  

 - After looking through the file we can deduce all the information we need is inside ['data']['results']

In [3]:
# loop over all files and put them into a dataframe
#Create the variables we will use to loop the data
folder_name = '..\data'
filenames = os.listdir(folder_name)
df = pd.DataFrame()
empty_files = []
#Iterate through every data file we have
for file in filenames:
    #ensure files are "json" files
    if file.endswith(".json"):
        file_path = os.path.join(folder_name, file)

        with open(file_path, 'r') as f:
            try:
                data = json.load(f)
                #create a small dataframe which we will add onto the large one
                small_df = pd.DataFrame(data['data']['results'])
                # print(file, "Loaded Sucessfully") - for testing purposes
                #add the new data to the bottom of our dataframe
                if small_df.empty:
                    # print("file is empty:", file)
                    empty_files.append(file)
                else:
                    df = pd.concat([df, small_df], ignore_index = True)
            except json.JSONDecodeError as e:
                #print if there was an error
                print("Error Decoding file:", e, file)
    else:
        #print out any files that are not part of it
        print("Not a Json:", file)
            
df.head()


Not a Json: .gitkeep


  df = pd.concat([df, small_df], ignore_index = True)


Not a Json: processed


Unnamed: 0,primary_photo,last_update_date,source,tags,permalink,status,list_date,open_houses,description,branding,...,photos,flags,community,products,virtual_tours,other_listings,listing_id,price_reduced_amount,location,matterport
0,{'href': 'https://ap.rdcpix.com/07097d34c98a59...,2023-09-19T20:52:50Z,"{'plan_id': None, 'agents': [{'office_name': '...","[carport, community_outdoor_space, cul_de_sac,...",9453-Herbert-Pl_Juneau_AK_99801_M90744-30767,sold,2023-06-29T21:16:25.000000Z,,"{'year_built': 1963, 'baths_3qtr': None, 'sold...","[{'name': 'EXP Realty LLC - Southeast Alaska',...",...,"[{'tags': [{'label': 'house_view', 'probabilit...","{'is_new_construction': None, 'is_for_rent': N...",,{'brand_name': 'basic_opt_in'},,"{'rdc': [{'listing_id': '2957241843', 'listing...",2957241843.0,45000.0,"{'address': {'postal_code': '99801', 'state': ...",False
1,,,,,8477-Thunder-Mountain-Rd_Juneau_AK_99801_M9424...,sold,,,"{'year_built': None, 'baths_3qtr': None, 'sold...","[{'name': None, 'photo': None, 'type': 'Office'}]",...,,"{'is_new_construction': None, 'is_for_rent': N...",,,,"{'rdc': [{'listing_id': '2958935271', 'listing...",,,"{'address': {'postal_code': '99801', 'state': ...",False
2,,,,,4515-Glacier-Hwy_Juneau_AK_99801_M94790-68516,sold,,,"{'year_built': None, 'baths_3qtr': None, 'sold...","[{'name': None, 'photo': None, 'type': 'Office'}]",...,,"{'is_new_construction': None, 'is_for_rent': N...",,,,"{'rdc': [{'listing_id': '2958935192', 'listing...",,,"{'address': {'postal_code': '99801', 'state': ...",False
3,,,,,17850-Point-Stephens-Rd_Juneau_AK_99801_M98793...,sold,,,"{'year_built': None, 'baths_3qtr': None, 'sold...","[{'name': None, 'photo': None, 'type': 'Office'}]",...,,"{'is_new_construction': None, 'is_for_rent': N...",,,,"{'rdc': [{'listing_id': '2958925235', 'listing...",,,"{'address': {'postal_code': '99801', 'state': ...",False
4,,,,,9951-Stephen-Richards-Memorial-Dr_Juneau_AK_99...,sold,,,"{'year_built': None, 'baths_3qtr': None, 'sold...","[{'name': None, 'photo': None, 'type': 'Office'}]",...,,"{'is_new_construction': None, 'is_for_rent': N...",,,,"{'rdc': [{'listing_id': '2958924367', 'listing...",,,"{'address': {'postal_code': '99801', 'state': ...",False


In [4]:
empty_files
#manually went through and checked a chunk of these files to ensure they are empty and wasn't an error on my coding part.

['HI_Honolulu_3.json',
 'HI_Honolulu_4.json',
 'ME_Augusta_0.json',
 'ME_Augusta_1.json',
 'ME_Augusta_2.json',
 'ME_Augusta_3.json',
 'ME_Augusta_4.json',
 'MS_Jackson_0.json',
 'MS_Jackson_1.json',
 'MS_Jackson_2.json',
 'MS_Jackson_3.json',
 'MS_Jackson_4.json',
 'ND_Bismarck_2.json',
 'ND_Bismarck_3.json',
 'ND_Bismarck_4.json',
 'NH_Concord_3.json',
 'NH_Concord_4.json',
 'SD_Pierre_0.json',
 'SD_Pierre_1.json',
 'SD_Pierre_2.json',
 'SD_Pierre_3.json',
 'SD_Pierre_4.json',
 'VT_Montpelier_0.json',
 'VT_Montpelier_1.json',
 'VT_Montpelier_2.json',
 'VT_Montpelier_3.json',
 'VT_Montpelier_4.json',
 'WY_Cheyenne_0.json',
 'WY_Cheyenne_1.json',
 'WY_Cheyenne_2.json',
 'WY_Cheyenne_3.json',
 'WY_Cheyenne_4.json']

## Data Cleaning and Wrangling

At this point, ensure that you have all sales in a dataframe.
- Take a quick look at your data (i.e. `.info()`, `.describe()`) - what do you see?
- Is each cell one value, or do some cells have lists?
- What are the data types of each column?
- Some sales may not actually include the sale price (target).  These rows should be dropped.
- There are a lot of NA/None values.  Should these be dropped or replaced with something?
    - You can drop rows or use various methods to fills NA's - use your best judgement for each column 
    - i.e. for some columns (like Garage), NA probably just means no Garage, so 0
- Drop columns that aren't needed
    - Don't keep the list price because it will be too close to the sale price. Assume we want to predict the price of houses not yet listed

In [5]:
# load and concatenate data here
# drop or replace values as necessary


In [6]:
df.info() # looks like we only actually want the information in "location", "tags", "property_id" and "description"

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8159 entries, 0 to 8158
Data columns (total 23 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   primary_photo         7403 non-null   object 
 1   last_update_date      8125 non-null   object 
 2   source                7752 non-null   object 
 3   tags                  7638 non-null   object 
 4   permalink             8159 non-null   object 
 5   status                8159 non-null   object 
 6   list_date             7752 non-null   object 
 7   open_houses           0 non-null      object 
 8   description           8159 non-null   object 
 9   branding              8159 non-null   object 
 10  list_price            7721 non-null   float64
 11  lead_attributes       8159 non-null   object 
 12  property_id           8159 non-null   object 
 13  photos                7403 non-null   object 
 14  flags                 8159 non-null   object 
 15  community            

In [7]:
df.describe()

Unnamed: 0,list_price,price_reduced_amount
count,7721.0,2484.0
mean,434158.2,24427.04
std,551492.5,71623.96
min,1.0,100.0
25%,209000.0,6000.0
50%,325000.0,10100.0
75%,499900.0,20000.0
max,12500000.0,2015999.0


In [8]:
#Quick function to break down the inside of a column into its seperate parts
def break_it_down(column):
    col_dict = column.to_dict()
    col_df = pd.DataFrame(col_dict).transpose()
    return col_df

In [9]:
#lets make a full list of columns we want to get rid of:
print("columns before drop:", df.columns)
columns_to_drop = ['primary_photo',
                   'last_update_date',
                     'source', 
                      'permalink',
                        'status',
                          'list_date',
                           'open_houses',
                            'branding',
                             'list_price',
                              'lead_attributes',
                                'photos',
                                'virtual_tours',
                                'other_listings',
                                 'listing_id',
                                  'price_reduced_amount',
                                   'matterport',
                                    'sold_date',
                                     'products',
                                      'street_view_url',
                                       'community',
                                        'county',
                                         'line',
                                          'tags',
                                           'flags',
                                            'name',
                                             'baths_1qtr',
                                              'sub_type',
                                               'baths_full',
                                                'baths_half',
                                                 'baths_3qtr',
                                                  'state_code']

#description column is nested, lets pull it out and put it back in all split up nice and pretty

desc_df = break_it_down(df['description'])
columns_to_drop.append('description')
df[desc_df.columns] = desc_df
#lets do the same with the "location column"
loc_df = break_it_down(df['location'])
columns_to_drop.append('location')
#unfortunately "address" is still nested
df[loc_df.columns] = loc_df
address_df = break_it_down(loc_df['address'])
df[address_df.columns] = address_df
columns_to_drop.append('address')
#break down the coordinates so we have them as well.
coordinate_df = break_it_down(address_df['coordinate'])
columns_to_drop.append('coordinate')
df[coordinate_df.columns] = coordinate_df
#will get rid of description so lets throw it onto our list to drop
# columns_to_drop.append(['description', 'location', 'address', 'coordinate'])
#lets get rid of everything without a sold price
df = df.dropna(subset=['sold_price'])
#lets create a seperate dataframe called "tags", we will come back to that later and add those details back in.
df_tags = df[['property_id', 'tags']]
df = df.drop(columns=columns_to_drop, axis=1)

print(f"columns after drops: {df.columns}")
df = df.drop_duplicates()
df



columns before drop: Index(['primary_photo', 'last_update_date', 'source', 'tags', 'permalink',
       'status', 'list_date', 'open_houses', 'description', 'branding',
       'list_price', 'lead_attributes', 'property_id', 'photos', 'flags',
       'community', 'products', 'virtual_tours', 'other_listings',
       'listing_id', 'price_reduced_amount', 'location', 'matterport'],
      dtype='object')
columns after drops: Index(['property_id', 'year_built', 'sold_price', 'lot_sqft', 'sqft', 'baths',
       'garage', 'stories', 'beds', 'type', 'postal_code', 'state', 'city',
       'lon', 'lat'],
      dtype='object')


Unnamed: 0,property_id,year_built,sold_price,lot_sqft,sqft,baths,garage,stories,beds,type,postal_code,state,city,lon,lat
30,8846541030,1998,129900,11761,1478,2,2,1,3,single_family,36117,Alabama,Montgomery,-86.178412,32.389075
31,7727981021,1945,88500,6534,1389,2,1,2,4,single_family,36107,Alabama,Montgomery,-86.273286,32.382748
32,7320925131,1969,145000,17424,2058,2,,1,3,single_family,36109,Alabama,Montgomery,-86.221454,32.380023
33,7231604965,1955,65000,9712,1432,2,,1,3,single_family,36107,Alabama,Montgomery,-86.284387,32.386844
34,7700690979,1984,169000,10890,1804,2,,1,3,single_family,36106,Alabama,Montgomery,-86.232662,32.351898
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7993,4542127284,1910,99000,4792,1214,1,1,2,3,single_family,25314,West Virginia,Charleston,-81.644994,38.341576
7994,3895826397,,29700,7841,988,1,,,3,single_family,25387,West Virginia,Charleston,-81.661662,38.377371
7995,4941005485,,162250,65340,1470,1,,,3,single_family,25314,West Virginia,Charleston,-81.659885,38.338617
7996,4306867390,,63800,,,0,,,0,single_family,25302,West Virginia,Charleston,-81.644214,38.363038


In [10]:
#lets fill some NAN values, start with garage and the nones we can assume to be 0
df['garage'] = df['garage'].fillna(0)

# lets take the average latitude and longitude for each city 
lon_lat_dict = df[['city', 'lon', 'lat']].groupby('city').mean().transpose().to_dict()
# lets fill those latitude and longitude back into the dataframe for each city
# function to replace the longitude and latitude with the average supplied
# def city_replacement(dict, df):
#     for city, coords in dict.items():
#         df.loc[(df['city'] == city) & (df['lon'].isna()), 'lon'] = coords['lon']
#         df.loc[(df['city'] == city) & (df['lat'].isna()), 'lat'] = coords['lat']
#     return df
df = fv.item_replacement(lon_lat_dict, df, 'city', 'lon', 'lat')
#there are still a few missing values so I will use google and manually fill in the last three
missing_dict = {'Boone': {'lon': -93.885490, 'lat': 42.060650},
                'Garnett': {'lon': 81.2454, 'lat': 32.6063},
                'Charlton Heights': {'lon': -81.24385, 'lat': 38.13673}}
df = fv.item_replacement(missing_dict, df, 'city', 'lon', 'lat')
#all the missing cities (there are 5) are from the one place in columbus ohio
df['city'] = df['city'].fillna('Columbus')
#lets fix some of the "type" column, we're gonna adjust the "other" values to land judging by the other values with them,
#"condos" will be adjusted to "condo"
# we will fill the NaN cells with 'land' as well
type_mapping = {'other': 'land',
                'condos': 'condo',
                }
df['type'] = df['type'].replace(type_mapping)
df['type'] = df['type'].fillna('land')

#lets change the 'year_built', 'sqft', 'baths', 'stories', 'beds' all to 0 for the 'land' types
to_change_list = ['year_built', 'sqft', 'baths', 'stories', 'beds']
for col in to_change_list:
    df.loc[(df['type'] == 'land') & (df[col].isna()), col] = 0 
#lets fill the missing year's built column with the mean from that column with no better method to fill
mean_year = df['year_built'].mean()
df['year_built'] = df['year_built'].fillna(mean_year.astype(int))

#we're going to take the average number of beds and baths from the 'type' to fill into the 
bed_bath_dict = df[['type', 'beds', 'baths']].groupby('type').mean().astype(int).transpose().to_dict()

df = fv.item_replacement(bed_bath_dict, df, 'type', 'beds','baths')
#lets fill the stories down to 0
df['stories'] = df['stories'].fillna(0)
#dropping the two 'types' 'condo_townhome_rowhome_coop' and 'duplex_triplex' as there is only a single one of each
df = df[df['type'] != 'condo_townhome_rowhome_coop']
df = df[df['type'] != 'duplex_triplex']


  df['garage'] = df['garage'].fillna(0)
  df['year_built'] = df['year_built'].fillna(mean_year.astype(int))
  df['stories'] = df['stories'].fillna(0)


In [11]:
#we're going to do the same with the sqft and lot_sqft, taking the averages from types in cities
sqfts_dict = df[['city', 'type', 'sqft', 'lot_sqft']].groupby(['type','city']).mean().transpose().to_dict()

df = fv.sqfts_replacement(sqfts_dict, df)

#we're going to repeat the process using the the averages from just the types as not all cities had multiple different types in them to get the averages

sqfts_dict_2 = df[['type', 'sqft', 'lot_sqft']].groupby('type').mean().transpose().to_dict()

df = fv.item_replacement(sqfts_dict_2, df, 'type', 'sqft', 'lot_sqft')

#we need to assign datatypes to the columns, so we will make lists of the different types and re-enter them back in.
int_columns = ['year_built', 'sqft', 'lot_sqft', 'baths','garage','stories','beds', 'postal_code']
df[int_columns] = df[int_columns].astype(int)
category_columns = ['type', 'city', 'state']
df[category_columns] = df[category_columns].astype('category')
float_columns = ['lon', 'lat']
df[float_columns] = df[float_columns].astype('float64')
df.info()


<class 'pandas.core.frame.DataFrame'>
Index: 1473 entries, 30 to 8038
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   property_id  1473 non-null   object  
 1   year_built   1473 non-null   int32   
 2   sold_price   1473 non-null   object  
 3   lot_sqft     1473 non-null   int32   
 4   sqft         1473 non-null   int32   
 5   baths        1473 non-null   int32   
 6   garage       1473 non-null   int32   
 7   stories      1473 non-null   int32   
 8   beds         1473 non-null   int32   
 9   type         1473 non-null   category
 10  postal_code  1473 non-null   int32   
 11  state        1473 non-null   category
 12  city         1473 non-null   category
 13  lon          1473 non-null   float64 
 14  lat          1473 non-null   float64 
dtypes: category(3), float64(2), int32(8), object(2)
memory usage: 112.3+ KB


### Dealing with Tags

Consider the fact that with tags, there are a lot of categorical variables.
- How many columns would we have if we OHE tags, city and state?
- Perhaps we can get rid of tags that have a low frequency.

In [12]:
df_tags = encode_tags(df_tags)
df = pd.merge(left=df, right=df_tags, on='property_id', how='left')
df

Unnamed: 0,property_id,year_built,sold_price,lot_sqft,sqft,baths,garage,stories,beds,type,...,tags_volleyball,tags_washer_dryer,tags_water_view,tags_waterfront,tags_well_water,tags_white_kitchen,tags_wine_cellar,tags_wooded_land,tags_wrap_around_porch,tags_nan
0,8846541030,1998,129900,11761,1478,2,2,1,3,single_family,...,0,0,0,0,0,0,0,0,0,0
1,7727981021,1945,88500,6534,1389,2,1,2,4,single_family,...,0,0,0,0,0,0,0,0,0,0
2,7320925131,1969,145000,17424,2058,2,0,1,3,single_family,...,0,0,0,0,0,0,0,0,0,0
3,7231604965,1955,65000,9712,1432,2,0,1,3,single_family,...,0,0,0,0,0,0,0,0,0,0
4,7700690979,1984,169000,10890,1804,2,0,1,3,single_family,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1468,4542127284,1910,99000,4792,1214,1,1,2,3,single_family,...,0,0,0,0,0,0,0,0,0,0
1469,3895826397,1858,29700,7841,988,1,0,0,3,single_family,...,0,0,0,0,0,0,0,0,0,0
1470,4941005485,1858,162250,65340,1470,1,0,0,3,single_family,...,0,0,0,0,0,0,0,0,0,0
1471,4306867390,1858,63800,16592,1566,0,0,0,0,single_family,...,0,0,0,0,0,0,0,0,0,0


### Dealing with Cities

- Sales will vary drastically between cities and states.  Is there a way to keep information about which city it is without OHE?
- Could we label encode or ordinal encode?  Yes, but this may have undesirable effects, giving nominal data ordinal values.
- What we can do is use our training data to encode the mean sale price by city as a feature (a.k.a. Target Encoding)
    - We can do this as long as we ONLY use the training data - we're using the available data to give us a 'starting guess' of the price for each city, without needing to encode city explicitly
- If you replace cities or states with numerical values (like the mean price), make sure that the data is split so that we don't leak data into the training selection. This is a great time to train test split. Compute on the training data, and join these values to the test data
- Note that you *may* have cities in the test set that are not in the training set. You don't want these to be NA, so maybe you can fill them with the overall mean

In [13]:
# perform train test split here
X = df.drop('sold_price', axis=1)
y = df['sold_price']
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=.2, random_state=42)
# do something with state and city

## Extra Data - STRETCH

> This doesn't need to be part of your Minimum Viable Product (MVP). We recommend you write a functional, basic pipeline first, then circle back and join new data if you have time

> If you do this, try to write your downstream steps in a way it will still work on a dataframe with different features!

- You're not limited to just using the data provided to you. Think/ do some research about other features that might be useful to predict housing prices. 
- Can you import and join this data? Make sure you do any necessary preprocessing and make sure it is joined correctly.
- Example suggestion: could mortgage interest rates in the year of the listing affect the price? 

In [14]:
# import, join and preprocess new data here

## EDA/ Visualization

Remember all of the EDA that you've been learning about?  Now is a perfect time for it!
- Look at distributions of numerical variables to see the shape of the data and detect outliers.    
    - Consider transforming very skewed variables
- Scatterplots of a numerical variable and the target go a long way to show correlations.
- A heatmap will help detect highly correlated features, and we don't want these.
    - You may have too many features to do this, in which case you can simply compute the most correlated feature-pairs and list them
- Is there any overlap in any of the features? (redundant information, like number of this or that room...)

In [15]:
# perform EDA here

## Scaling and Finishing Up

Now is a great time to scale the data and save it once it's preprocessed.
- You can save it in your data folder, but you may want to make a new `processed/` subfolder to keep it organized

In [None]:
#scaling all the data to fit our future models better
scaler = StandardScaler()
scaling_columns = int_columns + float_columns
X_train[scaling_columns] = scaler.fit_transform(X_train[scaling_columns])
X_test[scaling_columns] = scaler.transform(X_test[scaling_columns])
#save the data, we will take a full save of it all as both a csv and a pickle so that we have it backed up, then we will also save the X/y train/test data seperately as well.
df.to_csv('../data/processed/data_complete.csv')
df.to_pickle('../data/processed/data_complete.pkl')
X_test.to_pickle('../data/processed/X_test.pkl')
X_train.to_pickle('../data/processed/X_train.pkl')
y_test.to_pickle('../data/processed/y_test.pkl')
y_train.to_pickle('../data/processed/y_train.pkl')

['year_built', 'sqft', 'lot_sqft', 'baths', 'garage', 'stories', 'beds', 'postal_code', 'lon', 'lat']
