# FIT5196 Assessment 2: Cleansing and Integrating Raw Data

#### Student Name: Chandni Gupta
#### Student ID: 29079896

Date: Sunday, 13 May 2018 

Environment: Python 3.6.4 and Anaconda3-5.1.0 (64-bit)

Libraries used:
* pandas 0.22.0 (for data frame, included in Anaconda Python 3.6) 
* LinearRegression (for fit() and predict() functions)
* math (for modf() functions)


# Introduction
This assignment consists of following tasks:

1. Auditing and Cleansing the Job dataset,
2. Integrating the Job datasets,
3. Finding missing value and fill in the reasonable values,
4. Finding the outliers.

# Task3. Finding missing value and fill in the reasonable values 

### 1. Import Libraries

In [None]:
import pandas as pd
from sklearn.linear_model import LinearRegression
import math

### 2. Read a file

In [None]:
data3 = pd.read_csv("dataset3_with_missing.csv")
data3.head()

### 3. Display its information

In [None]:
data3.info()

### 4. Describe the data

In [None]:
data3.describe()

In [None]:
data3.describe(include=['O']) # 'O' for Objects

### 5. Check missing values

In [None]:
data3.isnull().sum()

###### Observations:
- bathrooms column contains 400 missing values
- sqft_living column contains 66 missing values
- sqft_above column contains 67 missing values
- sqft_basement column contains 67 missing values


### 6. Impute missing values of 'sqft_living'

###### Checking if sqft_above and sqft_basement contain null values for the null values of sqft_living

In [None]:
living_null = data3[data3['sqft_living'].isna()]
print(living_null[living_null['sqft_above'].isna() == True])
print(living_null[living_null['sqft_basement'].isna() == True])

###### Adding sqft_above and sqft_basement and filling null values of sqft_living with the result

In [None]:
data3['sqft_living'] = data3['sqft_living'].fillna(data3['sqft_above'] + data3['sqft_basement'])
data3['sqft_living'].describe()

###### Observations:
- After imputing missing values of 'sqft_living', its count has become 9967 which is equal to the original data count.

### 7. Impute missing values of 'sqft_above'

###### Checking if sqft_living and sqft_basement contain null values for the null values of sqft_above

In [None]:
above_null = data3[data3['sqft_above'].isna()]
print(above_null[above_null['sqft_living'].isna() == True])
print(above_null[above_null['sqft_basement'].isna() == True])

###### Subtracting sqft_living and sqft_basement and filling null values of sqft_above with the result

In [None]:
data3['sqft_above'] = data3['sqft_above'].fillna(data3['sqft_living'] - data3['sqft_basement'])
data3['sqft_above'].describe()

###### Observations:
- After imputing missing values of 'sqft_above', its count has become 9967 which is equal to the original data count.

### 8. Impute missing values of 'sqft_basement'

###### Checking if sqft_living and sqft_above contain null values for the null values of sqft_basement

In [None]:
basement_null = data3[data3['sqft_basement'].isna()]
print(basement_null[basement_null['sqft_living'].isna() == True])
print(basement_null[basement_null['sqft_above'].isna() == True])

###### Subtracting sqft_living and sqft_above and filling null values of sqft_basement with the result

In [None]:
data3['sqft_basement'] = data3['sqft_basement'].fillna(data3['sqft_living'] - data3['sqft_above'])
data3['sqft_basement'].describe()

###### Observations:
- After imputing missing values of 'sqft_basement', its count has become 9967 which is equal to the original data count.

### 9. Imputing missing values of 'bedrooms'

###### Displaying rows consisting of null values of bathrooms

In [None]:
data3[data3['bathrooms'].isna()].head()

###### Calling LinearRegression() function and initialising a variable using it

In [None]:
lg = LinearRegression()

###### Creating a new dataframe using price, bedrooms, sqft_living, floors and bathrooms columns of data3 dataframe

In [None]:
data_null = data3[['price','bedrooms','sqft_living','floors','bathrooms']]
data_no_null = data_null.dropna() # Null values have been dropped

###### Initialising a variable using all columns execpt the 'bathrooms' column for which values are to be predicted

In [None]:
train_x = data_no_null.iloc[:,:4]
train_x.head()

###### Initialising a variable with 'bathrooms' column whose values are to be predicted

In [None]:
train_y = data_no_null.iloc[:,4]
train_y.head()

###### Fitting based on the trained variables

In [None]:
lg.fit(train_x,train_y)

In [None]:
test_data = data_null.iloc[:,:4]

###### Predicting the values of bathrooms

In [None]:
bath_pred = pd.DataFrame()
bath_pred['bathrooms'] = lg.predict(test_data)

###### Filling the null values with the predicted ones

In [None]:
data3['bathrooms'].fillna(bath_pred['bathrooms'],inplace = True)

###### Checking the predicted values according to the specifications of the values of bathrooms provided

In [None]:
#For loop on data3
for index,row in data3.iterrows():
    values = round(math.modf(row.bathrooms)[0],2)
    # Values after decimal other than .0, .25, .5 and .75 have been rounded and replaced 
    if (values != 0.0) & (values != 0.5) & (values != 0.25) & (values != 0.75):
        row.bathrooms = math.modf(row.bathrooms)[1] + round(values)
    data3.set_value(index,'bathrooms', row.bathrooms)

###### Checking if any column contain null values after imputations

In [None]:
data3.isnull().sum()

###### Observations:
- No Column contains missing values.

In [None]:
data3.head()

### 10. Converting dataframe into CSV file format 

In [None]:
data3.to_csv('./dataset3_solution.csv',encoding='utf-8')

## References
- Tutorial 5
- Tutorial 6A
- stackoverflow.com