# AIDI 1002: Machine Learning Programming — Assignment - 1

1. Consider the dataset ‘noisy_data.csv’ and apply the following pre-processing techniques and obtain the clean dataset.

 - Handling missing values by imputation
 - Apply Normality tests to numerical columns and state the hypothesis clearly and comment on the normality of the data
 - Apply encodings for categorical variable and scale the features

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv("noisy_data.csv")
df

Unnamed: 0,Region,Age,Income,Online Shopper
0,India,49.0,86400.0,No
1,Brazil,32.0,57600.0,Yes
2,USA,35.0,64800.0,No
3,Brazil,43.0,73200.0,No
4,USA,45.0,,Yes
5,India,40.0,69600.0,Yes
6,Brazil,,62400.0,No
7,India,53.0,94800.0,Yes
8,USA,55.0,99600.0,No
9,India,42.0,80400.0,Yes


In [3]:
features = df.iloc[:,:-1].values
target = df.iloc[:,-1].values

## Handling Missing Values By Imputation

In [4]:
from sklearn.impute import SimpleImputer

imputa = SimpleImputer(missing_values = np.nan, strategy = 'mean')

imputa.fit(features[:, 1:3])
features[:, 1:3] = imputa.transform(features[:, 1:3])

features

array([['India', 49.0, 86400.0],
       ['Brazil', 32.0, 57600.0],
       ['USA', 35.0, 64800.0],
       ['Brazil', 43.0, 73200.0],
       ['USA', 45.0, 76533.33333333333],
       ['India', 40.0, 69600.0],
       ['Brazil', 43.77777777777778, 62400.0],
       ['India', 53.0, 94800.0],
       ['USA', 55.0, 99600.0],
       ['India', 42.0, 80400.0]], dtype=object)

In [5]:
print('1',features[:,-1])
print('2',features[:,-2])

1 [86400.0 57600.0 64800.0 73200.0 76533.33333333333 69600.0 62400.0 94800.0
 99600.0 80400.0]
2 [49.0 32.0 35.0 43.0 45.0 40.0 43.77777777777778 53.0 55.0 42.0]


## Normality Test

In [6]:
from scipy.stats import shapiro

feature_1 = features[:,-1]
feature_2 = features[:,-2]

stat1, p_value1 = shapiro(feature_1)
stat2, p_value2 = shapiro(feature_2)

if p_value1 > 0.05:
    print("Variable 1 is normally distributed.")
else:
    print("Variable 1 is not normally distributed.")

if p_value2 > 0.05:
    print("Variable 2 is normally distributed.")
else:
    print("Variable 2 is not normally distributed.")


Variable 1 is normally distributed.
Variable 2 is normally distributed.


## Encoding Categorical Variables

In [7]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder= 'passthrough')
features = np.array(ct.fit_transform(features))
features

array([[0.0, 1.0, 0.0, 49.0, 86400.0],
       [1.0, 0.0, 0.0, 32.0, 57600.0],
       [0.0, 0.0, 1.0, 35.0, 64800.0],
       [1.0, 0.0, 0.0, 43.0, 73200.0],
       [0.0, 0.0, 1.0, 45.0, 76533.33333333333],
       [0.0, 1.0, 0.0, 40.0, 69600.0],
       [1.0, 0.0, 0.0, 43.77777777777778, 62400.0],
       [0.0, 1.0, 0.0, 53.0, 94800.0],
       [0.0, 0.0, 1.0, 55.0, 99600.0],
       [0.0, 1.0, 0.0, 42.0, 80400.0]], dtype=object)

In [8]:
from sklearn.preprocessing import LabelEncoder
LE = LabelEncoder()
target = LE.fit_transform(target)
target

array([0, 1, 0, 0, 1, 1, 0, 1, 0, 1])

## Feature Scaling

In [9]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaled_features = scaler.fit_transform(features)
scaled_features

array([[-0.65465367,  1.22474487, -0.65465367,  0.75887436,  0.74947325],
       [ 1.52752523, -0.81649658, -0.65465367, -1.71150388, -1.43817841],
       [-0.65465367, -0.81649658,  1.52752523, -1.27555478, -0.89126549],
       [ 1.52752523, -0.81649658, -0.65465367, -0.11302384, -0.25320042],
       [-0.65465367, -0.81649658,  1.52752523,  0.17760889,  0.        ],
       [-0.65465367,  1.22474487, -0.65465367, -0.54897294, -0.52665688],
       [ 1.52752523, -0.81649658, -0.65465367,  0.        , -1.0735698 ],
       [-0.65465367,  1.22474487, -0.65465367,  1.34013983,  1.38753832],
       [-0.65465367, -0.81649658,  1.52752523,  1.63077256,  1.75214693],
       [-0.65465367,  1.22474487, -0.65465367, -0.25834021,  0.29371249]])

2. Consider the text present in the file ‘wiki.txt’ and Answer the following questions :

    
 - Write a program to convert following text into tokens with two tokenization methods such as ‘RegexpTokenizer()’ and 
  ‘word_tokenize()’ from NLTK library. (Note :The tokens should not have stop words and punctuation symbols. Feel free to 
   decide about the correct list of stop words; e.g., negative words (don’t) could be important for you. Execute both methods
   of tokenization along with your code of removing stop words and punctuation.)
 - Write a regular expression to extract all the year mentions in the ‘wiki.txt’ file.
 - State the differences observed in the output of tokenization methods.


In [10]:
! pip install nltk



In [11]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')

from nltk.tokenize import RegexpTokenizer, word_tokenize
from nltk.corpus import stopwords
import string

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\aksha\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\aksha\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [12]:
# Load the dataset from a text file
with open('wiki.txt', 'r') as file:
    data = file.read()
    
# Tokenization using RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')
tokens_regexp = tokenizer.tokenize(data)

# Tokenization using word_tokenize
tokens_word = word_tokenize(data)

# Remove stop words
stop_words = set(stopwords.words('english'))
tokens_regexp = [token for token in tokens_regexp if token.lower() not in stop_words]
tokens_word = [token for token in tokens_word if token.lower() not in stop_words]

# Remove punctuation symbols
tokens_regexp = [token for token in tokens_regexp if token not in string.punctuation]
tokens_word = [token for token in tokens_word if token not in string.punctuation]

# Print the tokens
print("Tokens using RegexpTokenizer:")
print(tokens_regexp)

print("\nTokens using word_tokenize:")
print(tokens_word)


Tokens using RegexpTokenizer:
['history', 'NLP', 'generally', 'started', '1950s', 'although', 'work', 'found', 'earlier', 'periods', '1950', 'Alan', 'Turing', 'published', 'article', 'titled', 'Computing', 'Machinery', 'Intelligence', 'proposed', 'called', 'Turing', 'test', 'criterion', 'intelligence', 'Georgetown', 'experiment', '1954', 'involved', 'fully', 'automatic', 'translation', 'sixty', 'Russian', 'sentences', 'English', 'authors', 'claimed', 'within', 'three', 'five', 'years', 'machine', 'translation', 'would', 'solved', 'problem', '2', 'However', 'real', 'progress', 'much', 'slower', 'ALPAC', 'report', '1966', 'found', 'ten', 'year', 'long', 'research', 'failed', 'fulfill', 'expectations', 'funding', 'machine', 'translation', 'dramatically', 'reduced', 'Little', 'research', 'machine', 'translation', 'conducted', 'late', '1980s', 'first', 'statistical', 'machine', 'translation', 'systems', 'developed', 'notably', 'successful', 'NLP', 'systems', 'developed', '1960s', 'SHRDLU', 

In [13]:
import re

with open('wiki.txt', 'r') as file:
    data = file.read()
    
pattern = '\d{4}'
dates = re.findall(pattern, data)
dates

['1950', '1950', '1954', '1966', '1980', '1960', '1964', '1966']

Observations

1. Word tokenization includes special letters and punctuation, while ordinary expression tokenization does not. 
2. Using a regular expression technique, the tokens are created using the NLTK word tokenizer's default tokenizer and divided depending on word boundaries.
3. The word tokenizer found 143 tokens in the wiki.txt dataset, compared to 137 tokens found by the regular expression tokenizer. 

3. Consider this dataset from kaggle. (Download the dataset from following link : https://www.kaggle.com/dansbecker/
melbourne-housing-snapshot/home) and answer the following questions :
 - Apply the feature selection techniques over the melbourne-housing -dataset namely (20 points):
    ∗ Correlation
    ∗ Chi-Square
    ∗ Mutual-Information
    ∗ Random Forest feature importance
 - Compare the importance of selected features using bar chart (10 points).
 - Comment on the results obtained from various feature selection techniques and which is the best and worst feature selection      technique on the given dataset (10 points). 

In [14]:
import pandas as pd
df = pd.read_csv("melb_data.csv")
df.head()

Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,Postcode,...,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount
0,Abbotsford,85 Turner St,2,h,1480000,S,Biggin,03-12-2016,2.5,3067,...,1,1.0,202,,,Yarra,-37.7996,144.9984,Northern Metropolitan,4019
1,Abbotsford,25 Bloomburg St,2,h,1035000,S,Biggin,04-02-2016,2.5,3067,...,1,0.0,156,79.0,1900.0,Yarra,-37.8079,144.9934,Northern Metropolitan,4019
2,Abbotsford,5 Charles St,3,h,1465000,SP,Biggin,04-03-2017,2.5,3067,...,2,0.0,134,150.0,1900.0,Yarra,-37.8093,144.9944,Northern Metropolitan,4019
3,Abbotsford,40 Federation La,3,h,850000,PI,Biggin,04-03-2017,2.5,3067,...,2,1.0,94,,,Yarra,-37.7969,144.9969,Northern Metropolitan,4019
4,Abbotsford,55a Park St,4,h,1600000,VB,Nelson,04-06-2016,2.5,3067,...,1,2.0,120,142.0,2014.0,Yarra,-37.8072,144.9941,Northern Metropolitan,4019


In [15]:
from sklearn.feature_selection import SelectKBest, chi2, mutual_info_classif
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder, MinMaxScaler

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13580 entries, 0 to 13579
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Suburb         13580 non-null  object 
 1   Address        13580 non-null  object 
 2   Rooms          13580 non-null  int64  
 3   Type           13580 non-null  object 
 4   Price          13580 non-null  int64  
 5   Method         13580 non-null  object 
 6   SellerG        13580 non-null  object 
 7   Date           13580 non-null  object 
 8   Distance       13580 non-null  float64
 9   Postcode       13580 non-null  int64  
 10  Bedroom2       13580 non-null  int64  
 11  Bathroom       13580 non-null  int64  
 12  Car            13518 non-null  float64
 13  Landsize       13580 non-null  int64  
 14  BuildingArea   7130 non-null   float64
 15  YearBuilt      8205 non-null   float64
 16  CouncilArea    12211 non-null  object 
 17  Lattitude      13580 non-null  float64
 18  Longti

In [16]:
# Preprocess the dataset
# Remove rows with missing values
df.dropna(inplace=True)

# Encode categorical variables
encoder = LabelEncoder()
df['Suburb'] = encoder.fit_transform(df['Suburb'])
df['Type'] = encoder.fit_transform(df['Type'])
df['Method'] = encoder.fit_transform(df['Method'])
df['Regionname'] = encoder.fit_transform(df['Regionname'])

# Split the dataset into features and target variable
X = df.drop(['Price','Address','SellerG', 'Date','CouncilArea', 'Postcode'], axis=1)
y = df['Price']

scaler = MinMaxScaler()
x = scaler.fit_transform(X)

In [17]:
# Feature selection using correlation

# Exclude non-numeric columns from the correlation matrix
numeric_columns = df.select_dtypes(include=['float64', 'int64']).columns
correlation_matrix = df[numeric_columns].corr()

# Select features based on correlation with 'Price'
correlation_features = correlation_matrix.abs()['Price'].nlargest(10).index
df_correlation = df[correlation_features]

# Print the selected features
print("Correlation features:")
print(df_correlation.columns.tolist())

Correlation features:
['Price', 'Rooms', 'BuildingArea', 'Bedroom2', 'Bathroom', 'Type', 'YearBuilt', 'Car', 'Lattitude', 'Longtitude']


  correlation_matrix = df.corr()


In [18]:
# Feature selection using chi-square
chi2_selector = SelectKBest(score_func=chi2, k=10)
X_chi2 = chi2_selector.fit_transform(x, y)

print("\nChi-square features:")
print(X.columns[chi2_selector.get_support()].tolist())
print('Scores',chi2_selector.scores_)


Chi-square features:
['Suburb', 'Rooms', 'Type', 'Method', 'Distance', 'Bedroom2', 'Bathroom', 'Car', 'Regionname', 'Propertycount']
Scores [ 283.2024679   229.65192647 1919.26528772  297.16731238  116.33144932
  113.08855908  360.31473464  107.64960098   66.69154453   70.46453266
    5.02193561   41.03981202   41.61254177  236.09459749  193.38627623]


In [19]:
# Sort the importance scores and feature names in descending order

feature_names = X.columns
sorted_indices = np.argsort(chi2_selector.scores_)[::-1]
sorted_importance_scores = chi2_selector.scores_[sorted_indices]
sorted_feature_names = feature_names[sorted_indices]
 
# Create the figure and axes
fig, ax = plt.subplots()

# Plot the sorted importance scores in a bar chart
ax.bar(sorted_feature_names, sorted_importance_scores)

# Rotate x-axis labels for better readability if needed
plt.xticks(rotation=90)
plt.legend(['Importance Score'])


# Set the x-axis label
ax.set_xlabel('Feature')

NameError: name 'plt' is not defined

In [None]:
# Feature selection using mutual information
mi_selector = SelectKBest(score_func=mutual_info_classif, k=10)
X_mi = mi_selector.fit_transform(X, y)

print('Scores: ', mi_selector.scores_)
print("\nMutual information features:")
print(X.columns[mi_selector.get_support()].tolist())

In [None]:
# Sort the importance scores and feature names in descending order

sorted_indices = np.argsort(mi_selector.scores_)[::-1]
sorted_importance_scores = mi_selector.scores_[sorted_indices]
sorted_feature_names = feature_names[sorted_indices]
 
# Create the figure and axes
fig, ax = plt.subplots()

# Plot the sorted importance scores in a bar chart
ax.bar(sorted_feature_names, sorted_importance_scores)

# Rotate x-axis labels for better readability if needed
plt.xticks(rotation=90)

plt.legend(['Importance Score'])

# Set the x-axis label
ax.set_xlabel('Feature')

In [None]:
clf = RandomForestClassifier( n_estimators=50)

model = clf.fit(X,y)
feat_importances = pd.DataFrame(model.feature_importances_, index=X.columns, columns=["Importance"])
feat_importances.sort_values(by='Importance', ascending=False, inplace=True)

print("\nRandom forest feature importance features:\n")
print(feat_importances)

In [None]:
import matplotlib.pyplot as plt
feat_importances.sort_values(by='Importance', ascending=False, inplace=True)
top_features = feat_importances.head(10)
top_features.plot(kind='bar')

Here, the bar graph shows the 10 most important features that contribute to the target variable, and is observed with the 
help of Random forest classifier. X-axis denotes the features and y-axis denotes the scores. Its noticed that Lattitude, Longitude, BuildingArea, and landsize has the highest importance scores. Method, Car, Suburb, and Propertycount being the least out of best 10 features. In addition, Bedroom2, Rooms, Bathroom, Regionname, Type having poor importance scores than others.

Conclusion

As a concluding observation, when the dataset's structure and techniques are taken into account, Random forest feature importance and mutual information tend to be more dependable and adaptable in capturing various kinds of correlations between features and the target variable. Only linear relationships may be represented using the correlation matrix, and categorical variables can only be studied using Chi-square.