<div style="text-align:center;">
  <img src="./data/image.png" alt="Image" />
</div>


<h1>Task 2 - Python: Problem statement</h1>

***House price prediction***

This is an actual assessment I went through for the Data Analyst position in a FTSE 100 company: 

It consisted of two Tasks total time 2 hours and 30 mins.

1. SQL 
2. Python 

dataset location: ./data/realest.csv

**description**

Target variable: Price (float)

- Describe the relationship between price vs features. 
- Build a linear regression model. 
- Handle missing values with listwise deletion. 

The dataset contains the following features:
	Bedroom   (int)
	Space     (int)
	Room      (int)   
	Lot		  (float)
	Tax		  (float)
	Bathroom  (int)
	Garage    (int)
    Condition (int) 0 good 1 bad?

The values in some of the columns may be missing so you must handle this properly.

Handle missing values with listwise_deletion(). Exclude the entire record from the analysis if any single value
is missing. (hint dropna() method). 

Note: Here in the actual test said the following. However this to me did not make sense as we would want to delete the rows with missing values before calculating the statistics.
Remember do not delete the rows before you have the 'statistics', 'data_frame' and 'number_of_observations' calculated.

Write a method named analyse_and_fit_lrm() which takes 1 argument(path) and returns a
dictionary of length 2 with the following objects

**summary_dict** - dictionary of length 3
1. **statistics**: mean, standard deviation, median, min, max for the variable Tax for all houses with more than 1 bedroom and 1 bathroom and order by Price decreasing
2. **data_frame**: filtered by Space > 800
3. **number_observations**: a numeric value corresponding to the number of observations for which the value of a variable Lot is equal to or bigger than the 4th 5-quantile of this variable

**regression_dict** - dictionary of length 2
1. **model_parameters**: length of 9 parameters named with the column names for each feature. First parameter should be named intercept.
2. **price_prediction**: predict for an X_new with Bedrooms = 3, Space = 1500, Room = 8, Lot = 40, Tax = 1000, bathroom = 2, Garage = 1, Condition = 0


There was an option to implement the linear regression from scratch. There was a link as well on the wikipedia webpage on how to implement linear regression with math. 

At the time I did not want to risk and so I went with sklearn.

In [4]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression

class AnalysisDataAndFitLinearRegression:

    def __init__(self):
        self.version = 1
        
    def etl(self, path): # Added this etl function: In the test it was included in the analyse_and_fit_lrm()
        # Load dataset
        data = pd.read_csv(path)
        
        # Perform listwise deletion to handle missing values
        data = self.__listwise_deletion(data)
        
        return data
    
    def linear_regression_from_scratch(X,y): # (Optional)
        
        # Add the bias term to the feature matrix basically adding the identity matrix
        X= np.hstack([np.ones((X.shape[0], 1)), X])

        # Compute the dot product of the transpose of X_with_bias and X_with_bias
        XTX = np.dot(X.T, X)

        # Compute the inverse of the resulting matrix
        XTX_inv = np.linalg.inv(XTX)

        # Compute the dot product of the transpose of X_with_bias and the target variable y
        XTy = np.dot(X.T, y)

        # That is the final equation to compute the weights.
        weights = np.dot(XTX_inv, XTy)

        # Make a prediction for X_new include the bias term
        X_new              = np.hstack([np.ones((X_new.shape[0], 1)), X_new])
        price_prediction   = np.dot(X_new, weights)         # E(y)=XW

        model_parameters = {**{'Intercept': weights.ravel()[0]}, **dict(zip(data.drop('Price', axis=1).columns, weights.ravel()[1:]))}
        regression_dict  = {
                            'model_parameters': model_parameters,
                            'price_prediction': price_prediction[0]
                    }
        return regression_dict
        
        
    def analyse_and_fit_lrm(self, path, option=None):
        
        # Extract transform and load data
        data = self.etl(path)

        # Calculate summary statistics for the variable Tax for all houses with 2 baths 4 beds
        mask = (data['Bathroom'] >1 ) & (data['Bedroom'] > 1)

        statistics = {
            "mean":    data.loc[mask, "Tax"].mean(),
            "std_dev": data.loc[mask, "Tax"].std(),
            "median":  data.loc[mask, "Tax"].median(),
            "min":     data.loc[mask, "Tax"].min(),
            "max":     data.loc[mask, "Tax"].max()
                     }

        # Store the filtered dataframe 
        mask                    = data.Space > 800
        data_frame              = data.loc[mask, :].sort_values(by='Price', ascending=False)

        # Calculate the 4th 5-quantile of the Lot variable
        quantile_4th_out_of_5   = data.Lot.quantile(4/5)

        number_of_observations  = sum(data.Lot >= quantile_4th_out_of_5)

        # storing the results
        summary_dict = {'statistics': statistics, 'data_frame': data_frame, 'number_of_observations': number_of_observations}

        # Remove NA's after summary-statistics
        data = self.__listwise_deletion(data) # overwrite

        ## Model train
        y = np.array(data['Price'].values  ).reshape(-1,1)           # target 
        X = np.array(data.drop('Price', axis=1).values) # features set

        if option is None:
            lm = LinearRegression() 
            lm.fit(X, y)
            # numpy 
            X_new = np.array([[3, 1500, 8, 40, 1000, 2, 1, 0]])
            # or pandas
            # X_new = pd.DataFrame([{'Bedroom': 3, 'Space': 1500, 'Room': 8, 'Lot': 40, 'Tax': 1000, 'Bathroom': 2, 'Garage': 1, 'Condition': 0 }])
            # predict price based on X_new
            price_prediction = lm.predict(X_new)

            # Store lm results
            model_parameters = {**{'Intercept': lm.intercept_[0]}, **dict(zip(data.drop('Price', axis=1).columns, lm.coef_.ravel()))}
            regression_dict  = {
                    'model_parameters': model_parameters,
                    'price_prediction': price_prediction[0]
                                }
        else:
            self.linear_regression_from_scratch(X,y)
            
        # return the summary and ml results
        return {'summary_dict': summary_dict, 'regression_dict': regression_dict}

    def __listwise_deletion(self, data: pd.DataFrame):
        return data.dropna()

In [5]:
analysis = AnalysisDataAndFitLinearRegression()
path     = "./data/realest.csv"
results  = analysis.analyse_and_fit_lrm(path)
# Print the results
print(results)

{'summary_dict': {'statistics': {'mean': 1325.0, 'std_dev': 221.7355782608345, 'median': 1300.0, 'min': 1100, 'max': 1600}, 'data_frame':    Bedroom  Space  Room  Lot   Tax  Bathroom  Garage  Condition  Price
3        5   1800    10   45  1600         3       2          0    420
4        4   1500     9   50  1400         3       2          1    380
1        4   1300     8   35  1200         2       1          1    290
2        3   1200     7   40  1100         2       1          1    270, 'number_of_observations': 1}, 'regression_dict': {'model_parameters': {'Intercept': -143.7080506789178, 'Bedroom': 2.5209859270886867, 'Space': 0.0007597089319073547, 'Room': -0.9735361536635794, 'Lot': 2.084394366675951, 'Tax': 0.287985511667638, 'Bathroom': 0.9792882668076303, 'Garage': 0.9792882668076303, 'Condition': 8.950492953145217}, 'price_prediction': array([231.50533241])}}


To solve this is required to have a good foundational knowledge of python's data structures as well as classes and functions.
Knowledge on basic pandas, numpy and sklearn and how to combine all these data types.

# Tips & reflections
- Pay attention to detail. Even if you forget to sort the wrong way your code will be wrong as it was mine.
- Please make sure that you get familiar with the Codility platform and that type of analysis. 
- Make sure you understand Python classes and you can implement functions inside a class. Very different to the classic jupyter notebooks style.
- During the assessment I had trouble locating a test dataset and debugging this program. 
- The mistake I did was running the code all together late in the session instead of bit by bit. Kind of like in jupyter.
- This made the process more complicated and more diffcult to debug once I hit errors.
- Even if the main method is correct sometimes reshaping data or typos or incorrect function calls or wrong order of code can leads to errors.
- Do go step by step do not run everything in the end.