# Data Preprocessing

The goal of this lab is to introduce you to data preprocessing techniques in order to make your data suitable for applying a learning algorithm.

## 1. Handling Missing Values

A common (and very unfortunate) data property is the ocurrence of missing and erroneous values in multiple features in datasets. For this exercise we will be using a data set about abalone snails.
The data set is contained in the Zip file you downloaded from Moodle (abalone.csv).

To determine the age of a abalone snail you have to kill the snail and count the annual
rings. You are told to estimate the age of a snail on the basis of the following attributes:
1. type: male (0), female (1) and infant (2)
2. length in mm
3. width in mm
4. height in mm
5. total weight in grams
6. weight of the meat in grams
7. drained weight in grams
8. weight of the shell in grams
9. number of annual rings (number of rings +1, 5 yields age)

However, the data is incomplete. Missing values are marked with −1.

In [17]:
import pandas as pd
# load data 
df = pd.read_csv("http://www.cs.uni-potsdam.de/ml/teaching/ss15/ida/uebung02/abalone.csv") #Should this not work please use the csv that was part of the zip file.
df.columns=['type','length','width','height','total_weight','meat_weight','drained_weight','shell_weight','num_rings']
df.head()

Unnamed: 0,type,length,width,height,total_weight,meat_weight,drained_weight,shell_weight,num_rings
0,0,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,-1
1,1,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9
2,0,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10
3,2,-1.0,0.255,0.08,0.205,0.0895,0.0395,0.055,7
4,2,0.425,0.3,0.095,0.3515,0.141,0.0775,0.12,8


In [18]:
import numpy as np

# df_nan = df.drop(columns='type')
df_nan = df.replace(-1,np.NaN)
df_nan.head()

Unnamed: 0,type,length,width,height,total_weight,meat_weight,drained_weight,shell_weight,num_rings
0,0.0,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,
1,1.0,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9.0
2,0.0,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10.0
3,2.0,,0.255,0.08,0.205,0.0895,0.0395,0.055,7.0
4,2.0,0.425,0.3,0.095,0.3515,0.141,0.0775,0.12,8.0


### Exercise 1.1

Compute the mean of of each numeric column and the counts of each categorical column, excluding the missing values.

In [19]:
df_drop = df_nan.dropna(axis=0)
df_drop.describe()

Unnamed: 0,type,length,width,height,total_weight,meat_weight,drained_weight,shell_weight,num_rings
count,3271.0,3271.0,3271.0,3271.0,3271.0,3271.0,3271.0,3271.0,3271.0
mean,0.955671,0.523445,0.407424,0.139467,0.826354,0.358718,0.179689,0.23833,9.925099
std,0.833652,0.120243,0.099358,0.042613,0.487932,0.220866,0.108907,0.139079,3.227777
min,0.0,0.075,0.055,0.0,0.002,0.001,0.0005,0.0015,1.0
25%,0.0,0.45,0.35,0.115,0.4405,0.186,0.0925,0.13,8.0
50%,1.0,0.545,0.425,0.14,0.8025,0.3385,0.171,0.235,9.0
75%,2.0,0.615,0.48,0.165,1.145,0.5005,0.25175,0.325,11.0
max,2.0,0.815,0.65,1.13,2.8255,1.351,0.76,1.005,29.0


### Exercise 1.2

Compute the median of each numeric column,  excluding the missing values.

In [20]:
df_drop.iloc[:,1:].median()

length            0.5450
width             0.4250
height            0.1400
total_weight      0.8025
meat_weight       0.3385
drained_weight    0.1710
shell_weight      0.2350
num_rings         9.0000
dtype: float64

### Exercise 1.3

Handle the missing values in a way that you find suitable. Think about different ways. Discuss dis-/advantages of your approach. Argue your choices.


In [21]:
df_mean = df_nan.fillna(df_nan.mean())
df_mean.head()

Unnamed: 0,type,length,width,height,total_weight,meat_weight,drained_weight,shell_weight,num_rings
0,0.0,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,9.921756
1,1.0,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9.0
2,0.0,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10.0
3,2.0,0.523692,0.255,0.08,0.205,0.0895,0.0395,0.055,7.0
4,2.0,0.425,0.3,0.095,0.3515,0.141,0.0775,0.12,8.0


### Exercise 1.4

Perform Z-score normalization on every column (except the type of course!)

In [22]:
numeric_columns = list(df_mean.columns)
numeric_columns.remove('type')

# with pandas
df_z = df_mean.copy()
df_z[numeric_columns] = (df_z[numeric_columns] - df_z[numeric_columns].mean()) / df_z[numeric_columns].std() # .iloc[:,1:] excluding type
df_z.head()

Unnamed: 0,type,length,width,height,total_weight,meat_weight,drained_weight,shell_weight,num_rings
0,0.0,-1.465927,-1.461399,-1.207063,-1.248081,-1.184826,-1.221122,-1.226923,-5.570299e-16
1,1.0,0.053238,0.12313,-0.112168,-0.314104,-0.46872,-0.359144,-0.208153,-0.2890442
2,0.0,-0.706344,-0.439122,-0.355478,-0.64715,-0.655729,-0.61403,-0.608384,0.02453569
3,2.0,0.0,-1.563627,-1.450374,-1.290487,-1.230438,-1.304539,-1.336077,-0.9162041
4,2.0,-0.832941,-1.103602,-1.085408,-0.987436,-0.995537,-0.952333,-0.863076,-0.6026242


In [24]:
from sklearn.preprocessing import StandardScaler
df_normalised = df_mean.copy

scaler = StandardScaler()
df_normalised[numeric_columns] = scaler.fit_transform(df_normalised[numeric_columns])

TypeError: 'method' object is not subscriptable

## 2. Preprocessing text (Optional)

One possible way to transform text documents into vectors of numeric attributes is to use the TF-IDF representation. We will experiment with this representation using the 20 Newsgroup data set. The data set contains postings on 20 different topics. The classification problem is to decide which of the topics a posting falls into. Here, we will only consider postings about medicine and space.

In [25]:
from sklearn.datasets import fetch_20newsgroups

categories = ['sci.med', 'sci.space']
raw_data = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=42)
print(f'The index of each category is: {[(i,target) for i,target in enumerate(raw_data.target_names)]}')

The index of each category is: [(0, 'sci.med'), (1, 'sci.space')]


Check out some of the postings, might find some funny ones!

In [26]:
import numpy as np

idx = np.random.randint(0, len(raw_data.data))
print (f'This is a {raw_data.target_names[raw_data.target[idx]]} email.\n')
print (f'There are {len(raw_data.data)} emails.\n')
print(raw_data.data[idx])

This is a sci.med email.

There are 1187 emails.

From: cfaks@ux1.cts.eiu.edu (Alice Sanders)
Subject: Frozen shoulder and lawn mowing
Organization: Eastern Illinois University
Lines: 12

Ihave had a frozen shoulder for over a year or about a year.  It is still
partially frozen, and I am still in physical therapy every week.  But the
pain has subsided almost completely.  UNTIL last week when I mowed the
lawn for twenty minutes each, two days in a row.  I have a push type power
mower.  The pain started back up a little bit for the first time in quite
a while, and I used ice and medicine again.  Can anybody explain why this
particular activity, which does not seem to stress me very much generally,
should cause this shoulder problem?

Thanks.

Alice



Lets pick the first 10 postings from each category

In [27]:
idxs_med = np.flatnonzero(raw_data.target == 0)
idxs_space = np.flatnonzero(raw_data.target == 1)
idxs = np.concatenate([idxs_med[:10],idxs_space[:10]])
data = np.array(raw_data.data)
data = data[idxs]

<a href="http://www.nltk.org/">NLTK</a> is a toolkit for natural language processing. Take some time to install it and go through this <a href="http://www.slideshare.net/japerk/nltk-in-20-minutes">short tutorial/presentation</a>. (or use e.g. Google colab where the package is prepared already)

The downloaded package below is a tokenizer that divides a text into a list of sentences, by using an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences.

In [28]:
import nltk
import itertools
nltk.download('punkt')

# Tokenize the sentences into words
tokenized_sentences = [nltk.word_tokenize(sent) for sent in data]
vocabulary_size = 1000
unknown_token = 'unknown'

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/melissathephasdin/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [29]:
# Count the word frequencies
word_freq = nltk.FreqDist(itertools.chain(*tokenized_sentences))
print (f"Found {len(word_freq.items())} unique words tokens.")

Found 1636 unique words tokens.


In [30]:
# Get the most common words and build index_to_word and word_to_index vectors
vocab = word_freq.most_common(vocabulary_size-1)
index_to_word = [x[0] for x in vocab]
index_to_word.append(unknown_token)
word_to_index = dict([(w,i) for i,w in enumerate(index_to_word)])
 
print (f"Using vocabulary size {vocabulary_size}." )
print (f"The least frequent word in our vocabulary is '{vocab[-1][0]}' and appeared {vocab[-1][1]} times.")

Using vocabulary size 1000.
The least frequent word in our vocabulary is 'AN' and appeared 1 times.


In [31]:
data

array(['From: geb@cs.pitt.edu (Gordon Banks)\nSubject: Re: Striato Nigral Degeneration\nReply-To: geb@cs.pitt.edu (Gordon Banks)\nOrganization: Univ. of Pittsburgh Computer Science\nLines: 16\n\nIn article <9303252134.AA09923@walrus.mvhs.edu> ktodd@walrus.mvhs.edu ((Ken Todd)) writes:\n>I would like any information available on this rare disease.  I understand\n>that an operation referred to as POLLIDOTOMY may be in order.  Does anyone\n>know of a physician that performs this procedure.  All responses will be\n>appreciated.  Please respond via email to ktodd@walrus.mvhs.edu\n\nIt isn\'t that rare, actually.  Many cases that are called Parkinson\'s\nDisease turn out on autopsy to be SND.  It should be suspected in any\ncase of Parkinsonism without tremor and which does not respond to\nL-dopa therapy.  I don\'t believe pallidotomy will do much for SND.\n\n-- \n----------------------------------------------------------------------------\nGordon Banks  N3JXP      | "Skepticism is the chast

### Exercise 2.1

Code your own TF-IDF representation function and use it on this dataset. (Don't use code from libraries. Build your own function with Numpy/Pandas). Use the formular TFIDF = TF * (IDF+1). The effect of adding “1” to the idf in the equation above is that terms with zero idf, i.e., terms that occur in all documents in a training set, will not be entirely ignored. The term frequency is the raw count of a term in a document. The inverse document frequency is the natural logarithm of the inverse fraction of the documents that contain the word.

In [32]:
from sklearn.feature_extraction.text import CountVectorizer
countvec = CountVectorizer()
df = pd.DataFrame(countvec.fit_transform(data).toarray(), columns=countvec.get_feature_names_out())
df

Unnamed: 0,02,041300,07,0815,10,101,10511,11,115397,12,...,yellow,yeltsin,yet,you,young,younger,your,z3,zeta,zeus
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
2,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,1,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,5,0,0,2,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,1,0,0,0,0,0,...,0,0,0,1,1,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,...,3,0,0,1,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [33]:
import warnings
warnings.filterwarnings('ignore')

### Version 1: calculating tf does not work

In [47]:
def tfidf(df):
    # term frequency
    term_count = df.sum(axis=1)
    term_count = df.sum(axis=1)
    tf = df.div(term_count, axis = 0)
    
    # number of documents that contain term
    doc_num_contains = df[df > 0].count()
    
    # total number of documents in data set
    doc_count = df.shape[0]
    
    # idf following formula from class 
    idf = np.log(doc_count/doc_num_contains)
          
    tfidf = tf * (idf + 1)
    
    return(tfidf)

### Version 2: calculation with initial df works?
.fit_transform initialises df with tf so a seperate calculation is not necessary

In [49]:
def tfidf(df):
    
    # number of documents that contain term
    doc_num_contains = df[df > 0].count()
    
    # total number of documents in data set
    doc_count = df.shape[0]
    
    # idf following formula from class 
    idf = np.log(doc_count/doc_num_contains)
          
    tfidf = df * (idf + 1)
    
    return(tfidf)

In [50]:
# tfidf(df)

In [51]:
from sklearn.feature_extraction.text import CountVectorizer
countvec = CountVectorizer()
df = pd.DataFrame(countvec.fit_transform(data).toarray(), columns=countvec.get_feature_names_out())
# row = instance
    
rep = tfidf(df)

# Check if your implementation is correct
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(norm=None, smooth_idf=False, use_idf=True)
X_train = pd.DataFrame(vectorizer.fit_transform(data).toarray(), columns=countvec.get_feature_names_out())
answer=['No','Yes']
epsilon = 0.0001
if rep is None: 
  print (f'Is this implementation correct?\nAnswer: {answer[0]}')
if rep is not None:
  print (f'Is this implementation correct?\nAnswer: {answer[1*np.all((X_train - rep) < epsilon)]}')

Is this implementation correct?
Answer: Yes


In [192]:
#def tfidf(df):
#    
#    tf = df.sum()
#    
#    denomIdf = df[df > 0].count()
#    
#    totDocs = df.shape[0]
#    
#    idf = np.log(totDocs/denomIdf)
#    
#    test = df.where(df < 1, 1)
#        
#    return(df * (test * (idf +1)))
#    
#tfidf(df)

In [52]:
# an example of what to do with these similarities:


# analysis with tf-idf
from sklearn.metrics.pairwise import cosine_similarity

similiarities = cosine_similarity(rep, rep) # measure of the similarity of the direction of two vectors

similarity calculates angle between two vectors 
coliniar - cosine similarity of 0 ?


In [53]:
np.fill_diagonal(similiarities, 0)
max_ind = np.unravel_index(similiarities.argmax(), similiarities.shape)
similiarities[max_ind] # highest similarity of two documents

0.3182784779118088