# Case Study: Regression

## Scenario

-------------------------------------------------------------------------------------------------------
You are working as an analyst for a real estate company. Your company wants to build a machine learning model to predict the selling prices of houses based on a variety of features on which the value of the house is evaluated.

------------------------------------------------------------------------------------------------------------

## Objective

----------------------------------------------------------------------------------------------------
Your job is to build a model that will predict the price of a house based on features provided in the dataset. Senior management also wants to explore the characteristics of the houses using some business intelligence tools. One of those parameters includes understanding which factors are responsible for higher property value - $650K and above.

---------------------------------------------------------------------------------------------

## Dataset

--------------------------------------------------------------------------------------------
It consists of information on 22,000 properties. The dataset consists of historic data of houses sold between May 2014 to May 2015.

These are the definitions of data points provided:

Note: For some of the variables are self-explanatory, no definition has been provided.

- Id: Unique identification number for the property
- date: the date the house was sold
- price: the price of the house
- waterfront: the house which has a view to a waterfront
- condition: How good the condition is (Overall). 1 indicates worn-out property and 5 excellent.
- grade: Overall grade given to the housing unit, based on the King County grading system. 1 poor, 13 excellent.
- Sqft_above: square footage of house apart from the basement
- Sqft_living15: Living room area in 2015(implies-- some renovations) This might or might not have affected the lotSize area.
- Sqft_lot15: lotSize area in 2015(implies-- some renovations)
    
---------------------------------------------------------------------------------------------

### 1. Data Cleaning

In [1]:
# Importing necessary libraries
import pandas as pd

In [2]:
# Loading the data into a df
df = pd.read_csv("C:\\Users\\mafal\\Documents\\ironhack\\projects\\data_mid_bootcamp_project_regression\\regression_data.csv")

#### 1.1 Describe method

In [3]:
df.head()

Unnamed: 0,id,date,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,...,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15,price
0,7129300520,10/13/14,3,1.0,1180,5650,1.0,0,0,3,...,1180,0,1955,0,98178,47.5112,-122.257,1340,5650,221900
1,6414100192,12/9/14,3,2.25,2570,7242,2.0,0,0,3,...,2170,400,1951,1991,98125,47.721,-122.319,1690,7639,538000
2,5631500400,2/25/15,2,1.0,770,10000,1.0,0,0,3,...,770,0,1933,0,98028,47.7379,-122.233,2720,8062,180000
3,2487200875,12/9/14,4,3.0,1960,5000,1.0,0,0,5,...,1050,910,1965,0,98136,47.5208,-122.393,1360,5000,604000
4,1954400510,2/18/15,3,2.0,1680,8080,1.0,0,0,3,...,1680,0,1987,0,98074,47.6168,-122.045,1800,7503,510000


In [4]:
df.shape

(21597, 21)

In [5]:
df.describe

<bound method NDFrame.describe of                id      date  bedrooms  bathrooms  sqft_living  sqft_lot  \
0      7129300520  10/13/14         3       1.00         1180      5650   
1      6414100192   12/9/14         3       2.25         2570      7242   
2      5631500400   2/25/15         2       1.00          770     10000   
3      2487200875   12/9/14         4       3.00         1960      5000   
4      1954400510   2/18/15         3       2.00         1680      8080   
...           ...       ...       ...        ...          ...       ...   
21592   263000018   5/21/14         3       2.50         1530      1131   
21593  6600060120   2/23/15         4       2.50         2310      5813   
21594  1523300141   6/23/14         2       0.75         1020      1350   
21595   291310100   1/16/15         3       2.50         1600      2388   
21596  1523300157  10/15/14         2       0.75         1020      1076   

       floors  waterfront  view  condition  ...  sqft_above  sqft

In [6]:
df.columns

Index(['id', 'date', 'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot',
       'floors', 'waterfront', 'view', 'condition', 'grade', 'sqft_above',
       'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode', 'lat', 'long',
       'sqft_living15', 'sqft_lot15', 'price'],
      dtype='object')

##### 1.2 Checking null values

In [7]:
df.isna().sum()

id               0
date             0
bedrooms         0
bathrooms        0
sqft_living      0
sqft_lot         0
floors           0
waterfront       0
view             0
condition        0
grade            0
sqft_above       0
sqft_basement    0
yr_built         0
yr_renovated     0
zipcode          0
lat              0
long             0
sqft_living15    0
sqft_lot15       0
price            0
dtype: int64

#### 1.3 Checking Column Types

In [8]:
# Display the data types of the DataFrame
print(df.dtypes)

id                 int64
date              object
bedrooms           int64
bathrooms        float64
sqft_living        int64
sqft_lot           int64
floors           float64
waterfront         int64
view               int64
condition          int64
grade              int64
sqft_above         int64
sqft_basement      int64
yr_built           int64
yr_renovated       int64
zipcode            int64
lat              float64
long             float64
sqft_living15      int64
sqft_lot15         int64
price              int64
dtype: object


In [9]:
# Print the column name and unique values for each column
for column in df.columns:
    unique_values = df[column].unique()
    print(f"Column: {column}")
    print(f"Unique values: {unique_values}\n")

Column: id
Unique values: [7129300520 6414100192 5631500400 ... 1523300141  291310100 1523300157]

Column: date
Unique values: ['10/13/14' '12/9/14' '2/25/15' '2/18/15' '5/12/14' '6/27/14' '1/15/15'
 '4/15/15' '3/12/15' '4/3/15' '5/27/14' '5/28/14' '10/7/14' '1/24/15'
 '7/31/14' '5/29/14' '12/5/14' '4/24/15' '5/14/14' '8/26/14' '7/3/14'
 '5/16/14' '11/20/14' '11/3/14' '6/26/14' '12/1/14' '6/24/14' '3/2/15'
 '11/10/14' '12/3/14' '6/13/14' '12/30/14' '2/13/15' '6/20/14' '7/15/14'
 '8/11/14' '7/7/14' '10/28/14' '7/29/14' '7/18/14' '3/25/15' '7/16/14'
 '4/28/15' '3/11/15' '9/16/14' '2/17/15' '12/31/14' '2/5/15' '3/3/15'
 '8/19/14' '4/7/15' '8/27/14' '2/23/15' '12/10/14' '8/28/14' '10/21/14'
 '12/7/14' '6/3/14' '9/9/14' '10/9/14' '8/25/14' '6/12/14' '9/12/14'
 '1/5/15' '6/10/14' '7/10/14' '3/16/15' '11/5/14' '4/20/15' '6/9/14'
 '3/23/15' '12/2/14' '12/22/14' '1/28/15' '6/2/14' '11/14/14' '6/18/14'
 '5/19/14' '9/4/14' '5/22/14' '2/26/15' '7/25/14' '12/23/14' '9/8/14'
 '3/30/15' '7/11/14' '6/

-------------------------------------------------------------------------------------------------
The following columns are used as a classification feature, so it would make sense to convert them into category:
- waterfront
- condition
- grade
zipcode

And since id is the unique key, we shouldn't see it necessarily as a numerical column, therefore we can also convert it into type category.

------------------------------------------------------------------------------------------------

In [10]:
# List of columns to convert to categorical
columns_to_convert = ['waterfront', 'condition', 'grade', 'id', 'zipcode']

# Convert specified columns to categorical
for col in columns_to_convert:
    df[col] = df[col].astype('object')
    
# Display the data types of the DataFrame
print(df.dtypes)

id                object
date              object
bedrooms           int64
bathrooms        float64
sqft_living        int64
sqft_lot           int64
floors           float64
waterfront        object
view               int64
condition         object
grade             object
sqft_above         int64
sqft_basement      int64
yr_built           int64
yr_renovated       int64
zipcode           object
lat              float64
long             float64
sqft_living15      int64
sqft_lot15         int64
price              int64
dtype: object


#### 1.4 Cleaning Duplicates

In [11]:
# Filter and display duplicated rows
df[df['id'].duplicated() == True]

Unnamed: 0,id,date,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,...,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15,price
94,6021501535,12/23/14,3,1.50,1580,5000,1.0,0,0,3,...,1290,290,1939,0,98117,47.6870,-122.386,1570,4500,700000
314,4139480200,12/9/14,4,3.25,4290,12103,1.0,0,3,3,...,2690,1600,1997,0,98006,47.5503,-122.102,3860,11244,1400000
325,7520000520,3/11/15,2,1.00,1240,12092,1.0,0,0,3,...,960,280,1922,1984,98146,47.4957,-122.352,1820,7460,240500
346,3969300030,12/29/14,4,1.00,1000,7134,1.0,0,0,3,...,1000,0,1943,0,98178,47.4897,-122.240,1020,7138,239900
372,2231500030,3/24/15,4,2.25,2180,10754,1.0,0,0,5,...,1100,1080,1954,0,98133,47.7711,-122.341,1810,6929,530000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20165,7853400250,2/19/15,4,3.50,2910,5260,2.0,0,0,3,...,2910,0,2012,0,98065,47.5168,-121.883,2910,5260,645000
20597,2724049222,12/1/14,2,2.50,1000,1092,2.0,0,0,3,...,990,10,2004,0,98118,47.5419,-122.271,1330,1466,220000
20654,8564860270,3/30/15,4,2.50,2680,5539,2.0,0,0,3,...,2680,0,2013,0,98045,47.4759,-121.734,2680,5992,502000
20764,6300000226,5/4/15,4,1.00,1200,2171,1.5,0,0,3,...,1200,0,1933,0,98133,47.7076,-122.342,1130,1598,380000


In [12]:
df[df['id']==6021501535]

Unnamed: 0,id,date,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,...,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15,price
93,6021501535,7/25/14,3,1.5,1580,5000,1.0,0,0,3,...,1290,290,1939,0,98117,47.687,-122.386,1570,4500,430000
94,6021501535,12/23/14,3,1.5,1580,5000,1.0,0,0,3,...,1290,290,1939,0,98117,47.687,-122.386,1570,4500,700000


In [13]:
# Print column id, date (which differs in value) and grade which isn't displayed due to lack of space
df[df['id'] == 6021501535][['id', 'date', 'grade','price']]

Unnamed: 0,id,date,grade,price
93,6021501535,7/25/14,8,430000
94,6021501535,12/23/14,8,700000


In [14]:
df[df['id']==4139480200]

Unnamed: 0,id,date,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,...,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15,price
313,4139480200,6/18/14,4,3.25,4290,12103,1.0,0,3,3,...,2690,1600,1997,0,98006,47.5503,-122.102,3860,11244,1380000
314,4139480200,12/9/14,4,3.25,4290,12103,1.0,0,3,3,...,2690,1600,1997,0,98006,47.5503,-122.102,3860,11244,1400000


In [15]:
# Print column id, date (which differs in value) and grade which isn't displayed due to lack of space
df[df['id'] == 4139480200][['id', 'date', 'grade','price']]

Unnamed: 0,id,date,grade,price
313,4139480200,6/18/14,11,1380000
314,4139480200,12/9/14,11,1400000


-------------------------------------------------------------------------------------------------------
To clean the duplicates we'll keep only the rows with the most recent date.

------------------------------------------------------------------------------------------------------

In [16]:
# Function that cleans the duplicates keeping the rows with the most recent date
def keep_latest_date(df, id_col, date_col):
    # Convert the date column to datetime type
    df[date_col] = pd.to_datetime(df[date_col])

    # Sort the DataFrame by the ID and date columns in descending order
    df.sort_values(by=[id_col, date_col], ascending=False, inplace=True)

    # Drop duplicates, keeping the first (latest) occurrence
    df.drop_duplicates(subset=id_col, keep='first', inplace=True)

    return df

In [17]:
# Applying the function defined above
cleaned_df = df.copy()
cleaned_df = keep_latest_date(cleaned_df, 'id', 'date')

# Filter and display duplicated rows
cleaned_df[cleaned_df['id'].duplicated() == True]

  df[date_col] = pd.to_datetime(df[date_col])


Unnamed: 0,id,date,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,...,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15,price


--------------------------------------------------------------------------------------------------
Now that we don't have duplicates we can turn 'id' column into our index.

---------------------------------------------------------------------------------------------------

In [18]:
# Display the data types of the DataFrame
print(cleaned_df.dtypes)

id                       object
date             datetime64[ns]
bedrooms                  int64
bathrooms               float64
sqft_living               int64
sqft_lot                  int64
floors                  float64
waterfront               object
view                      int64
condition                object
grade                    object
sqft_above                int64
sqft_basement             int64
yr_built                  int64
yr_renovated              int64
zipcode                  object
lat                     float64
long                    float64
sqft_living15             int64
sqft_lot15                int64
price                     int64
dtype: object


#### 1.5 Extra cleaning

In [19]:
# Check the row that has 33 bedrooms
cleaned_df[cleaned_df["bedrooms"] == 33]

Unnamed: 0,id,date,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,...,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15,price
15856,2402100895,2014-06-25,33,1.75,1620,6000,1.0,0,0,5,...,1040,580,1947,0,98103,47.6878,-122.331,1330,4700,640000


In [20]:
# We are going to drop the row that has 33 bedrooms, because it doesn't add up with the rest of the information
final_df = cleaned_df[cleaned_df['bedrooms'] != 33]

In [21]:
# Dropping the column yr_renovated as 18033 rows have experienced a renovation but the year renovated shows as 0
final_df = final_df.drop('yr_renovated', axis=1)

In [22]:
final_df

Unnamed: 0,id,date,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,zipcode,lat,long,sqft_living15,sqft_lot15,price
15937,9900000190,2014-10-30,3,1.00,1320,8100,1.0,0,0,3,6,880,440,1943,98166,47.4697,-122.351,1000,8100,268950
20963,9895000040,2014-07-03,2,1.75,1410,1005,1.5,0,0,3,9,900,510,2011,98027,47.5446,-122.018,1440,1188,399900
7614,9842300540,2014-06-24,3,1.00,1100,4128,1.0,0,0,4,7,720,380,1942,98126,47.5296,-122.379,1510,4538,339000
3257,9842300485,2015-03-11,2,1.00,1040,7372,1.0,0,0,5,7,840,200,1939,98126,47.5285,-122.378,1930,5150,380000
16723,9842300095,2014-07-25,5,2.00,1600,4168,1.5,0,0,3,7,1600,0,1927,98126,47.5297,-122.381,1190,4168,365000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3553,3600057,2015-03-19,4,2.00,1650,3504,1.0,0,0,3,7,760,890,1951,98144,47.5803,-122.294,1480,3504,402500
8800,2800031,2015-04-01,3,1.00,1430,7599,1.5,0,0,4,6,1010,420,1930,98168,47.4783,-122.265,1290,10320,235000
8404,1200021,2014-08-11,3,1.00,1460,43000,1.0,0,0,3,7,1460,0,1952,98166,47.4434,-122.347,2250,20023,400000
6729,1200019,2014-05-08,4,1.75,2060,26036,1.0,0,0,4,8,1160,900,1947,98166,47.4444,-122.351,2590,21891,647500


In [23]:
final_df.shape

(21419, 20)

In [24]:
final_df.describe

<bound method NDFrame.describe of                id       date  bedrooms  bathrooms  sqft_living  sqft_lot  \
15937  9900000190 2014-10-30         3       1.00         1320      8100   
20963  9895000040 2014-07-03         2       1.75         1410      1005   
7614   9842300540 2014-06-24         3       1.00         1100      4128   
3257   9842300485 2015-03-11         2       1.00         1040      7372   
16723  9842300095 2014-07-25         5       2.00         1600      4168   
...           ...        ...       ...        ...          ...       ...   
3553      3600057 2015-03-19         4       2.00         1650      3504   
8800      2800031 2015-04-01         3       1.00         1430      7599   
8404      1200021 2014-08-11         3       1.00         1460     43000   
6729      1200019 2014-05-08         4       1.75         2060     26036   
2495      1000102 2015-04-22         6       3.00         2400      9373   

       floors waterfront  view condition grade  sqft_

In [25]:
final_df.isna().sum()

id               0
date             0
bedrooms         0
bathrooms        0
sqft_living      0
sqft_lot         0
floors           0
waterfront       0
view             0
condition        0
grade            0
sqft_above       0
sqft_basement    0
yr_built         0
zipcode          0
lat              0
long             0
sqft_living15    0
sqft_lot15       0
price            0
dtype: int64

In [26]:
# Display the data types of the final DataFrame
print(final_df.dtypes)

id                       object
date             datetime64[ns]
bedrooms                  int64
bathrooms               float64
sqft_living               int64
sqft_lot                  int64
floors                  float64
waterfront               object
view                      int64
condition                object
grade                    object
sqft_above                int64
sqft_basement             int64
yr_built                  int64
zipcode                  object
lat                     float64
long                    float64
sqft_living15             int64
sqft_lot15                int64
price                     int64
dtype: object


In [29]:
# Export to a CSV file
final_df.to_csv('C:\\Users\\mafal\\Documents\\ironhack\\projects\\data_mid_bootcamp_project_regression\\regression_data_cleaned.csv', index=False)

-------------------------------------------------------------------------------------------
The file we just exported will now be used to answer the SQL questions.

-------------------------------------------------------------------------------------------