
<h3>Handling missing values:</h3>
a. Check for missing values in the dataset<br>
b. Decide whether to impute missing values or remove rows/columns with missing values<br>
c. Impute missing values using mean, median, or mode<br>

<h3>Handling duplicates:</h3>
a. Check for duplicates in the dataset<br>
b. Decide whether to remove duplicates or keep them<br>

<h3>Handling outliers:</h3>
a. Check for outliers in the dataset<br>
b. Decide whether to remove outliers or keep them<br>
c. Remove outliers using z-score or IQR method<br>

<h3>Handling categorical variables:</h3>
a. Convert categorical variables to numerical variables using one-hot encoding or label encoding<br>

<h3> Handling Imbalanced Data</h3><br>

<h3> Automated EDA Report</h3>


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import RandomOverSampler
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import IsolationForest

In [2]:
# importing dataset
df=pd.read_csv("new_york/AB_NYC_2019.csv")
df.head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,2018-10-19,0.21,6,365
1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,2019-05-21,0.38,2,355
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.9419,Private room,150,3,0,,,1,365
3,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,2019-07-05,4.64,1,194
4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,2018-11-19,0.1,1,0


# Handling missing values:

In [3]:
# Check for missing values
df.isnull().sum()

id                                    0
name                                 16
host_id                               0
host_name                            21
neighbourhood_group                   0
neighbourhood                         0
latitude                              0
longitude                             0
room_type                             0
price                                 0
minimum_nights                        0
number_of_reviews                     0
last_review                       10052
reviews_per_month                 10052
calculated_host_listings_count        0
availability_365                      0
dtype: int64

In [4]:
# Remove rows with missing values
df.dropna(inplace=True)

In [5]:
# for review per month column
mean_value = df['reviews_per_month'].mean()
df['reviews_per_month'].fillna(mean_value, inplace=True)

In [6]:
# for last_review column by filling with previous value
df['last_review'].fillna(method="pad", inplace=True)

# Handling duplicates

In [7]:
# Check for duplicates
df.duplicated().sum()

0

In [8]:
# for this dataset there is no duplicate if there was any we can remove by below code":

# Remove duplicates
df.drop_duplicates(inplace=True)

# Handling outliers
## Check for outliers using z-score and removing outliers using IQR method

<br>
<h3 style="color: red">
    
Z score Method: <br>
    
One common technique is to use the z-score to identify data points that are more than 3 standard deviations away from the mean. We can then handle these outliers by removing them or replacing them with more appropriate values.<br>


    
Inter quartile range (IQR) method: <br>
    
Each dataset can be divided into quartiles. The first quartile point indicates that 25% of the data points are below that value whereas second quartile is considered as median point of the dataset. The inter quartile method finds the outliers on numerical datasets by following the procedure below
Find the first quartile, Q1.
Find the third quartile, Q3.
Calculate the IQR. IQR= Q3-Q1.
Define the normal data range with lower limit as Q1–1.5*IQR and upper limit as Q3+1.5*IQR.
Any data point outside this range is considered as outlier and should be removed for further analysis.
The concept of quartiles and IQR can best be visualized from the boxplot. It has the minimum and maximum point defined as Q1–1.5*IQR and Q3+1.5*IQR respectively. Any point outside this range is outlier.
</h3>

In [9]:
# Check for outliers using z-score
from scipy import stats
z = np.abs(stats.zscore(df['price']))
threshold = 3
#df[(z < threshold)]
df=df[(z < threshold)]

In [10]:
# Remove outliers using IQR method
Q1 = df['price'].quantile(0.25)
Q3 = df['price'].quantile(0.75)
IQR = Q3 - Q1
df=df[~((df['price'] < (Q1 - 1.5 * IQR)) | (df['price'] > (Q3 + 1.5 * IQR)))]
df.head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,2018-10-19,0.21,6,365
1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,2019-05-21,0.38,2,355
3,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,2019-07-05,4.64,1,194
4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,2018-11-19,0.1,1,0
5,5099,Large Cozy 1 BR Apartment In Midtown East,7322,Chris,Manhattan,Murray Hill,40.74767,-73.975,Entire home/apt,200,3,74,2019-06-22,0.59,1,129


## Now we are going to use the StandardScaler class from sklearn.preprocessing to standardize the data and the IsolationForest class from sklearn.ensemble to identify the outliers. We set the contamination parameter to 0.01, which means that we expect around 1% of the data to be outliers

In [13]:
# Define the features to be used in the model
features = ['latitude', 'longitude', 'price', 'minimum_nights', 'number_of_reviews']

# Filter the data to only include the selected features
data = df[features]

# Handling outliers with Isolation Forest
# Standardize the data
scaler = StandardScaler()
data_scaled = scaler.fit_transform(data)

# Fit the model
model = IsolationForest(contamination=0.01)
model.fit(data_scaled)

# Identify the outliers
outliers = model.predict(data_scaled)
outliers = np.where(outliers == -1)[0]

## After finding the outliers we can drop using drop function 

# Handling categorical variables

In [15]:
# Convert categorical variables using one-hot encoding
# converting categorical variable for room_type
df_encoded = pd.get_dummies(df, columns=['room_type'])
df_encoded

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365,room_type_Entire home/apt,room_type_Private room,room_type_Shared room
0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,149,1,9,2018-10-19,0.21,6,365,0,1,0
1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,225,1,45,2019-05-21,0.38,2,355,1,0,0
3,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,89,1,270,2019-07-05,4.64,1,194,1,0,0
4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,80,10,9,2018-11-19,0.10,1,0,1,0,0
5,5099,Large Cozy 1 BR Apartment In Midtown East,7322,Chris,Manhattan,Murray Hill,40.74767,-73.97500,200,3,74,2019-06-22,0.59,1,129,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48782,36425863,Lovely Privet Bedroom with Privet Restroom,83554966,Rusaa,Manhattan,Upper East Side,40.78099,-73.95366,129,1,1,2019-07-07,1.00,1,147,0,1,0
48790,36427429,No.2 with queen size bed,257683179,H Ai,Queens,Flushing,40.75104,-73.81459,45,1,1,2019-07-07,1.00,6,339,0,1,0
48799,36438336,Seas The Moment,211644523,Ben,Staten Island,Great Kills,40.54179,-74.14275,235,1,1,2019-07-07,1.00,1,87,0,1,0
48805,36442252,1B-1B apartment near by Metro,273841667,Blaine,Bronx,Mott Haven,40.80787,-73.92400,100,1,2,2019-07-07,2.00,1,40,1,0,0


In [16]:
# Convert categorical variables using label encoding
# applying for neighbourhood_group
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['neighbourhood_group'] = le.fit_transform(df['neighbourhood_group'])

In [17]:
df["neighbourhood_group"]

0        1
1        2
3        1
4        2
5        2
        ..
48782    2
48790    3
48799    4
48805    0
48852    1
Name: neighbourhood_group, Length: 36685, dtype: int64

# Handling Imbalanced dataset

<h3 style="color: green">In machine learning, RandomOverSampler is a technique used to address class imbalance in datasets. Class imbalance occurs when one class in a binary classification problem has significantly fewer samples than the other class. This can lead to poor performance of machine learning models, particularly those that are sensitive to class distribution, such as decision trees and logistic regression.

RandomOverSampler is a method of oversampling the minority class in a dataset by randomly duplicating existing samples from that class until the size of the minority class is equal to that of the majority class. This technique can be useful in situations where the minority class is too small to be representative of the population or when there is a need to balance the class distribution to avoid bias in the model's predictions.

In [14]:
# Separate the target variable from the features
X = df.drop(['price'], axis=1)
y = df['price']

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Instantiate the RandomOverSampler to oversample the minority class
oversampler = RandomOverSampler(random_state=42)

# Fit and transform the training data using the oversampler
X_train_resampled, y_train_resampled = oversampler.fit_resample(X_train, y_train)

# Print the number of samples in each class before and after oversampling
print("Before oversampling:")
print(y_train.value_counts())
print("\nAfter oversampling:")
print(y_train_resampled.value_counts())

Before oversampling:
150    1290
100    1190
50      940
60      914
75      868
       ... 
306       1
284       1
287       1
293       1
253       1
Name: price, Length: 290, dtype: int64

After oversampling:
90     1290
114    1290
232    1290
188    1290
103    1290
       ... 
98     1290
137    1290
155    1290
68     1290
253    1290
Name: price, Length: 290, dtype: int64


<h4 style="color: green">---> In the code above, we first load the dataset and separate the target variable from the features. Then, we split the dataset into training and testing sets using the train_test_split function from the sklearn library. Next, we instantiate the RandomOverSampler with a random_state of 42 (for reproducibility) and fit and transform the training data using the fit_resample method. Finally, we print the number of samples in each class before and after oversampling.

<h4 style="color: green">
REPORT of DATA<br>

The above  line of code computes the following statistics of the data:
Type, unique values, and missing values that are essential.

Quantile statistics are a minimum value, a median value, a maximum value, a range value, and an interquartile range value.

Means, modes, standard deviations, sums, median absolute deviations, coefficients of variation, kurtosis, and skewness are descriptive statistics.

Values with the highest frequency.

Histogram.

There are two types of correlations: Spearman and Pearson matrices, which show correlated variables.

Values that are missing-Missing Values, Missing Values conclusion.

Customize Your Plot: Correlation matrices are useful for understanding the relationships between attributes.



In [None]:
from dataprep.eda import create_report
report = create_report(df)
report