# Analyze the data and apply a linear regression model

You are given a dataset, realest.csv, which contains information about house prices in the suburbs of Chicago. Your task is first to analyze the data, and then to apply a regression model to it.

# Data overview

You can access the dataset using the path ./data/realest.csv. The dataset consists of following variables:

- price: price of the house
- Bedroom: the number of bedrooms
- Space: size of the house (in square feet)
- Room: the number of rooms
- Lot: width of a lot
- Tax: amount of annual tax
- Bathroom: the number of bathrooms
- Garage: the number of parking lots in the garage
- condition: condition of the house (1 if good, 0 otherwise)

The values in some of the columns may be missing, so you must handle this properly.

We want to describe the relationship between Price (which will be a dependent variable in the model) and all other variables (predictors) using a linear regression model.

When building the linear regression model, you should handle missing data using a listwise deletion method: exclude an entire record from the analysis if any single value is missing (hint: dropna() method from pandas)

To fit a model to the data, you can either use built-in functions or calculate the parameters of the model from scratch. If you choose the latter approach, here you will find all the equations you need to implement as least-squares method for calculating model parameters.

# Task details

Write a class AnalysisDataAndFitLinearRegression with a method named analyse_and_fit_lrm() which takes one argument (a path to a dataset) and returns a dictionary of length 2 with the following objects (the order and names of the objects should be the same as below):

- summary_dict - a dictionary of length 3 with the following elements:

  (1) statistics - a list of numbers of length 5 with mean, standard deviation, median, minumum and maximum for a variable Tax for all houses with two bathrooms and four bedrooms.

  (2) data_frame - a data frame with observations for which Space is bigger than 800, ordered by decreasing Price.

  (3) number_of_observations - a numeric value corresponding to the number of observations for which the value of a variable Lot is equal to or bigger than the 80th percentile of this variable.  


- regression_dict

  (1) model_parameters - a dictionary of length 9 with the model parameters. The first key of the dictionary should be named Intercept, and all other keys should have the same name as the respective variable.

  (2) price_prediction - a numeric value which corresponds to the prediction of the price (using the appied model) for a house with the following specific parameters: three bedrooms; 1500 square feet of space; eight rooms; width of lot is 40; $1000 tax; two bathrooms; one space in the garage; house is in bad condition.


Apart from base Python, you can use the numpy, pandas and scikit-learn packages. Keep in mind that since python 3.6 dicts keep their insertion order.
 





In [3]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression

class AnalysisDataAndFitLinearRegression:
    def __init__(self):
        self.version = 1
    def analysis_and_fit_lrm(self, path):
        data = pd.read_csv(path)
        # an entire record deletion if any single value in missing (row)
        data = __listwise_deletion(data)
        # statistics for 5 data description
        statistics = [np.mean(data['TAX']), np.std(data['TAX']),
                      np.median(data['TAX']), np.min(data['TAX']),np.max(data['TAX'])]
        # data_frame for column "Space" which is bigger than 800 and ordered by decreasing 'Price'
        data_frame = data[(data['Space'] > 800)].sort_values(by = 'Price', ascending = False)
        # nums of observations for column 'Lot' which is equal or bigger than 80th percentile
        number_of_observations = len(data[(data['Lot'] >= np.percentile(data['Lot'], 80))])
        # define summary_dict
        summary_dict = {'statistics': statistics, 'data_frame': data_frame,
                        'number_of_observations': number_of_observations}
        # Load LinearRegression model from skilearn
        lr = LinearRegression()
        # Split x_features (independent variables) and y_target (dependent variable)
        x_features = data.drop(['Price'], axis = 1, inplace = False)
        y_target = data['Price']
        # Learning model
        lr.fit(x_features, y_target)
        # Assign 'column_name' : 'coefficient' in 'model_parameters'
        model_parameters = {}
        for i, (name, coef) in enumerate(zip(x_features.columns, lr.coef_)):
            if i == 0:
                model_parameters['Intercept'] = lr.intercept_
                model_parameters[name] = coef
            else:
                model_parameters[name] = coef
        # [['Bedroom', 'Space', 'Size', 'Room', 'Lot', 'Tax', 'Bathroom', 'Garage', 'Condition']]
        prediction_list = [[3, 1500, 8, 40, 1000, 2, 1, 0]]
        # Predict the 'Price'
        price_prediction = lr.predict(prediction_list)[0]
        # define regression_dict
        regression_dict = {'model_parameters': model_parameters, 'price_prediction': price_prediction}
        # return a dictionary of length 2
        return {summary_dict, regression_dict}
    def __listwise_deletion(self, data: pd.DataFrame):
        return data.dropna(axis = 0)

if __name__ == "__main__":
  AnalysisDataAndFitLinearRegression('./data/realest.csv')

