# Data Cleaning Practices

In this lab, we will apply learned concepts from Day 1-2 lectures to perform data cleaning on a given Airbnb data set. 

This dataset (raw.csv) contains 30k+ records on hotels in the top-10 tourist destinations and major US metropolitan areas sraped from Airbnb.com. 
Each data record has 40 attributes including the number of bedrooms, price, location, etc. 
The attribute "pop2016" means population of the zipcode location (area) in year 2016.
Demographic and economic attributes were scraped from city-data.com. 

Updated: short description for attributes:

## House specific features, collected from Airbnb.com:
Bathrooms: The number of bathrooms in the listing
Bedrooms: The number of bedrooms
Beds: The number of bed(s)
LocationName: Location of the house
NumGuests: Maximum number of guests can hold
NumReviews: number of reviews received
Price: daily price in local currency
# Rating: Y/N - whether the rating of each house is 5 or not
latitude: location information latitude
longitude: location information longitude
zipcode: zipcode of the house

## demographic and economic attributes based on zipcode, collected from city-data.com (means the same zipcode should share the same value for each of the following attributes)
pop2016: popularity of the area reported in 2016
pop2010: popularity of the area reported in 2010
pop2000: popularity of the area reported in 2000
cost_living_index: a U.S standarded index for cost living measurement
land_area: space of land
water_area: space of water area
pop_density: density of population 
number of males: within the area population
number of females: within the area population
prop taxes paid 2016: Median real estate property taxes paid for housing units in 2016
median taxes: median of taxes paid by house owners in the area
median house value: median of house value in the area
median household income: median of income of house owners in the area
median monthly onwer costs (with mortgage): median monthly cost of house owner including mortgage
median monthly onwer costs (no mortgage): median monthly cost of house owner without considering mortgage
median gross rent: the monthly rent agreed or contracted for plus the estimated monthly cost of utilities and fuels.
median asking price for vacant for-sale home/condo: median asking price for for-sale home in the area
unemployment: umemployment ratio of the area

## aggregated features for Abnb by zipcode 
Number of Homes	Count of Abnb:	number of Abnb houses in this area
Density of Abnb (%): ratio of Abnb houses in this area
Average Abnb Price (by zipcode): aggregated by zipcode
Average NumReviews (by zipcode): aggregated by zipcode	
Average Rating (by zipcode): aggregated by zipcode
Average Number of Bathrooms (by zipcode): aggregated by zipcode
Average Number of Bedrooms (by zipcode): aggregated by zipcode
Average Number of Beds (by zipcode): aggregated by zipcode
Average Number of Guests (by zipcode): aggregated by zipcode



The prediction label is Rating of house.

## Submission: submit via onq. 


In [None]:
# Step 1: Import needed libraries. E.g., pandas, missingno, and sklearn

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
from sklearn.svm import SVC
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from imblearn.over_sampling import RandomOverSampler
import missingno
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
pd.set_option('display.max_rows', None)  
pd.set_option('display.max_columns', None) 

Task 1: Read dataset and perform basic data exploration. Specially, you should write code to explore the types of data provided

In [None]:
df = pd.read_csv('raw.csv')
print(f"There are {df.shape[0]} rows and {df.shape[1]} features")
df.sample(20)

In [None]:
# TODO for Task 1, put your code here to perform data type and data scale check
print(df.info())
df.describe()

In [None]:
df['Rating'].value_counts().plot(kind='bar')

In [None]:
#distribution graph for each features that hasn't null values.
for column in df.columns:
    if df[column].dtype != object and df[column].isna().values.any() == False:
        fig = plt.figure(figsize=(10,10))
        sns.displot(df[column], kind="kde")
        plt.close(fig)


Task 2: Data quality check, does duplicate entries exit in this table? Do they have consistent values? Briefely explain your methodology and your findings within this markdown cell, and write corresponding code in the next code cell.

1- Does duplicate entries exit in this table? yes, there are duplicated entires in this dataframe.<br>
2- Do they have consistent values? yes, there are consistent duplicated values in this dataframe.

Methodology:
1. Check if there are any duplicated rows in this table.
2. Check if the number of duplicated rows in this table.
3. Drop any duplicated rows and keep the first occurances.

Finidings: 12.83% of the data is consistent duplicated values (duplicated entries through all features).

In [None]:
# TODO for Task 2
if df.duplicated().any()== True:
    print("There are duplicated values in this dataset")
else:
    print("There aren't duplicated values in this dataset")


print(f"There are {df.duplicated().sum()} consistent duplicated values in this dataset.")
df[df.duplicated()].head(20)

In [None]:
#drop duplicated values and keep the first occurance
df = df.drop_duplicates(keep='first')

Task 3: Data quality check, write code and answer:
3.1 does missing value exit in the table? 
3.2 Where are the missing data? 
3.3 How much data is missing?
3.4 Are there any variables often missing together？

You can use missingno library to generate plots to support your claim. 
Summarize your findings for task 3 in this markdown cell and write corresponding code in the next code cell.

1. does missing value exit in the table? yes there are missing values in some attributes in this dataset, There exist no record missing all variables. 
2. How much data is missing? you can see the following table to see how much data is missing in each attribute.
3. Are there any variables often missing together？yes, there are based on heatmap graph.
* 1. **(pop2010, pop2016, cost_living_index)** highly correlated in missing values.
* 2. **(median taxes (with mortage), median taxes (no mortage), median house value, median monthly owner costs (with mortage), monthly owner costs (no mortage))** highly correlated in missing values.
* 3. **(median gross rent, median asking price for vacant for-sale home/condo, unemployment (%), median asking price for vacant for-sale home/condo)** medium correlated values. 
* 4. **('Beds', 'LocationName', 'NumGuests', 'NumReviews', 'Price')**

These columns are highly correlated in missing values. highly correlated in missing values.

In [None]:
# TODO for Task 3

if df.isna().values.any() == True:
    #this condidtion will be evaluated.
    print("There are missing values in this dataframe")
else:
    print("There aren't missing values in this dataframe")
total_miss = df.isnull().sum()
percent_miss = (total_miss/df.isnull().count()*100)

# sort attributes by missing value ratio
missing_data = pd.DataFrame({'Total missing':total_miss,'% missing':percent_miss})
missing_data.sort_values(by='Total missing',ascending=False)

In [None]:
missingno.bar(df)
missingno.heatmap(df)

Task 4: What are the potential mechnisms of the missing values? Briefely explain your methodology and your findings (within this markdown cell), and write corresponding code in the next code cell.

methodology:

1. plot the histogram of each attribute that has missing value. 
2. plot the missing value matrix for the dataset.
3. determine if there is a specific pattern for missing values in each feature.

In [None]:
# TODO for Task 4
for column in df.columns:
    if df[column].dtype != object and df[column].isna().values.any() == True:
        fig = plt.figure(figsize=(10,10))
        #sns.displot(df[column], kind="hist")
        df.hist(column)
        plt.close(fig)
        
missingno.matrix(df)


findings for each attribute have missing values (why data are missing for each attribute and what's the pattern?)

1. **Bathrooms, Beds, LocationName, NumGuests, NumReviews, pop2010, 2016, cost_living_index, Number of homes , Density of Abnb (%)** they are could be **MAR(missing at random)** mechanism, since they are missed values continuously in a specific preiod of time randomly and they return back to be normal and didn't loss values again and no assumsions here were found to say they that follow **MNAR** mechanism.
2. the rest of attributes that has missing values they follow **MCAR (missing completely at Random)** mechanism since they are haven't known pattern in missing values so we can't determine why they are losing these values at these times like (**Bedrooms** attribute).
 

Task 5: Handling missing values, Briefely explain your methodology below (within this markdown cell), and write corresponding code in the next code cell.

Methdology (fill missing value in each attribute with simple imputation technique):

1. fill missing object attribute with the most frequented value.
2. fill missing integer attributes with the rounded mean value in each feature.
2. fill missing float attributes the with mean value in each feature.

In [None]:
# TODO for Task 5
df['LocationName'] = df['LocationName'].fillna(df['LocationName'].mode()[0])
for column in df.columns:
    if df[column].dtype != object and df[column].isna().values.any() == True:
        #check if the value is integer or it has 0 float point like 1.0000 fillna here is to ignore null values and it won't effect in the feature.
        if df[column].dtype == int or df[column].fillna(-9999).apply(float.is_integer).all() == True:
            df[column] = df[column].fillna(round(df[column].mean()))
        elif df[column].dtype == float or df[column].fillna(-9999).apply(float.is_integer).all() == False:
            df[column] = df[column].fillna(df[column].mean())

Task 6: Impact on classification performance. Consider one of the above handling method you proposed for this dataset and perform classification tast to investigate if your handling method can improve classificaiton performance. 

Train-test split: you can do one split of train and test where 70% of the data for training and the remaining 30% for testing. 
Classifier: you can pick any two tranditional binary classifier (e.g., from sklearn)

In [None]:
# TODO for Task 6
LE  = LabelEncoder()
#Let's convert Rating and LocationName name attribute to numerical values
df['Rating'] = LE.fit_transform(df['Rating'])
df['LocationName'] = LE.fit_transform(df['LocationName'])
#Let's convert Rating and LocationName from integer values to categorical values.
df['Rating'] = df['Rating'].astype('category')

y = df['Rating']
df = df.drop('Rating',axis=1)

columns = list(df.columns)

min_max_scaler = MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(df)
df = pd.DataFrame(x_scaled,columns=columns)
df['LocationName'] = df['LocationName'].astype('category')
#split the data to train and test.
X_train, X_test, y_train, y_test = train_test_split(df, y, stratify=y, train_size= 0.7 , shuffle=True)
df.head(5)

In [None]:
df.head(5)

### Decision tree classifier

In [None]:
#building the classifier
DT_classifier = DecisionTreeClassifier()
#train the classifier
DT_classifier.fit(X_train,y_train)
#test the classifier
y_predicted = DT_classifier.predict(X_test)
#evaluate the classifier
print(f"The accuracy score is {round(accuracy_score(y_test, y_predicted)*100,2)}%.")
print(classification_report(y_test,y_predicted))

Impact:

we could improve the performance if we try different preprocessing techniques like handling outliers, oversampling the dataset or try another null values handling like deletion technique by droping some feature that has a lot of null values or trying other classifiers.

### Logistic regression classifier

In [None]:
#building the classifier
LG_classifier = LogisticRegression(max_iter=10000)
#train the classifier
LG_classifier.fit(X_train,y_train)
#test the classifier
y_predicted = LG_classifier.predict(X_test)
#evaluate the classifier
print(f"The accuracy score is {round(accuracy_score(y_test, y_predicted)*100,2)}%.")
print(classification_report(y_test,y_predicted))

In [None]:
#building the classifier
GB_classifier = RandomForestClassifier()
#train the classifier
GB_classifier.fit(X_train,y_train)
#test the classifier
y_predicted = GB_classifier.predict(X_test)
#evaluate the classifier
print(f"The accuracy score is {round(accuracy_score(y_test, y_predicted)*100,2)}%.")
print(classification_report(y_test,y_predicted))

Task 7: Report your findings through the above experiments (in this markdown cell)

1. the data consists of 33145 rows and 40 features, 39 features as input and one feature as output.
2. The number of **No** values in the output features represent approximately the half number of **Yes** values. 
3. **12.83%** of the data is consistent duplicated values (duplicated entries through all features).
4. some features has missing values and some of them follow MCRA (missing completely at random) mechanism and the others are following the (MAR) mechanism.
5. some of features often missing together？
6. most of the features has normal distribution.
7. Between Logistic regression and decision tree classifier and Random forest classifier the last classifier was the best.
8. we could improve the performance if we try different preprocessing techniques like handling outliers or trying other classifiers.