<h1>Preprocessing</h1>

<p>
Preprocessing is a phase or a process applied after the Exploratory Data Analysis and before machine learning training. Here the data is started to modify already so that clean and clear data can be used to train the models.
</p>


<h2>Normalization (Also known as Feature Scaling)</h2>

<p>
It is a part of the preprocessing which is mandtory to applyso that the standard data can be given to models for training.
</br>
It has two types of scaling methods which are widely used:-
<ol>
<li>Min-Max Scaling</li>
<li>Standard Scaling</li>
</ol>

It is used tonormalize the data so that the features are in a similar range. It is important for those algortihms which are highly affected by the distribution of the shape or are based on the vector- or distance-based computations.</br>

<p>
For using the scaling methods we wil use the same dataframe as created in the EDA file before that is <strong>02_Exploratory_data_analysis.ipynb</strong>
</p>
</p>

<h2>Min-Max Scaling</h2>

<p>
It transforms each feature by compressing it down to a scale wheere the minimum number in the dataset maps to 0 and the maximum number in the dataset maps to 1.</br>
The transformation is given by:-</br>

![Image](Media/image.png)

Feature range (min, max) can be configured if required.
</p>


 ### fit() calculates parameters from the data, but does not change the data itself.
 ### transform() uses the parameters learned during fit() to actually modify the data.

In [None]:
import pandas as pd
df = pd.DataFrame({'Age': {0: 28, 1: 23, 2: 19},
'Gender_F': {0: 0.0, 1: 0.0, 2: 1.0},
'Gender_M': {0: 1.0, 1: 1.0, 2: 0.0},
'Degree_encoded': {0: 0.0, 1: 2.0, 2: 1.0}})
print(df)

In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(df[['Age']])
df['Age'] = scaler.transform(df[['Age']])
print(df)

<h2>Standard Scaling</h2>

<p>
It transforms each feature values by removing the mean and scaling to unit variance. the value thus represent the z-value with respect to the mean and variance of the column.</br>
The transformation is given by:-</br>

![Image](Media/image2.png)

where
- Î¼ is the mean
- s is the standard deviation of the samples

We can take the original values of the Age columns and scale it using StandardScaler
</p>

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
arr = scaler.fit_transform(df[['Age']])
# df['Age'] = scaler.transform(df[['Age']])
print(df)
print(arr)

In [None]:
scaler.mean_

In [None]:
scaler.scale_

<h2>Preprocessing Text</h2>

<p>
For preprocessing the text we need to use NLP ( Natural Language Processing ). NLP is a subject of processing and generating text. the most popular library used in python for text is NLTK ( Natural Language Toolkit ).
</p>

<h2>Five-Step of NLTK Pipeline</h2>

<p>
There are majorly five steps in using the NLTK pipeline to preprocess the textual data.
<ol>
<li>Segmentation</li>
<li>Tokenization</li>
<li>Stemming and Lemmatization</li>
<li>Removing Stopwords</li>
<li>Preparing Word Vectors</li>
</ol>
</p>

<h3>Segmentation</h3>

<p>
It is the process of finding the boundaries of the sentence.</br>
Mostly a complex regular expession is used to find out the boundaries of the sentence.
</p>

<h3>Tokenization</h3>

<p>
It breaks the sentence or a sequence inot the indivisual components or units called Tokens.
</br>
NLTK basic code below for <strong>Tokenization</strong>.
</p>

In [None]:
from nltk.tokenize import word_tokenize
word_tokenize("Let's learn machine learning!")

In [None]:
tokens = [t.lower() for t in word_tokenize("Let's learn the greatest machine learning!")]
print(tokens)

<h3><i>Stemming and Lemmatization</i></h3>

<p>
For grammatical reasons, the same word root can be present in different forms in the text. in most cases, they lead to a similar meaning, for example, work, wroking, worked. </br>
One popular method used for stemming is <strong>Porter's Stemmer</strong>. It performs the following set of rule based operations:-</br>

```
SSES -> SS
IES -> I
SS -> SS
S ->  
```

</p>

It's implimentation is as follows:- 

In [None]:
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()
for token in tokens:
    print(f'{token} : {stemmer.stem(token)}')

In [None]:
# For larger programs this will work
stemmed_tokens = [stemmer.stem(token) for token in tokens]
print(stemmed_tokens)

<h3>Removing Stopwords</h3>

<p>
There are some words which produce noise and slow down the process of training so they must be removed.
</br>These words are called as <b>Stopwords</b>.
</br>NLTK has the complete list of stopwords which can be used to filter out these words from the text and make the procewss faster.</br>
Below is the sample code for the topic:-
</p>

In [None]:
from nltk.corpus import stopwords
eng_stp = stopwords.words('english')
for token in stemmed_tokens:
    if token in eng_stp:
        stemmed_tokens.remove(token)
print(stemmed_tokens)

<h3>Word Vectors</h3>

<p>
Just like the OnehotEncoding and LabelEncoding this also converts the text to the numbers, however , the differenc is that OneHotEncoding and LabelEncoding converts the catagorical data to the numbers but here the text is converted to the vector on the basis of the occurance of that word in the sentence.
</p>

<p>
For this the function or the class used in opython is <strong>CountVectorizer</strong> present in the <strong>sklearn.feature_extraction.text</strong>
</p>
Sample code is written below:-

In [None]:
from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer

stemmer = PorterStemmer()
eng_stp  = stopwords.words('english')
data = ["Let's learn Machine Learning Now","The Machines are Learning","It is Learning Time"]
tokens = [word_tokenize(d) for d in data] # This creates a 2D array for the tokenized words in each line present in the data list.
# print(tokens)
tokens = [[word.lower() for word in line] for line in tokens]

for i, line in enumerate(tokens):
    for word in line:
        if word in stopwords.words('english'):
            line.remove(word)
    tokens[i] = ' '.join(line)
    
matrix = CountVectorizer()
X = matrix.fit_transform(tokens).toarray()

In [None]:
import pandas as pd
arr = pd.DataFrame(X, columns=matrix.get_feature_names_out())
print(arr)

<h2>Preprocessing Images</h2>

<p>
Here we will se the preprocessing of the Images.
</p>

In [None]:
import matplotlib.pyplot as plt
img = plt.imread("D:\\Coding\\Python\\ML_practice\\Media\\dog.png")

In [None]:
plt.imshow(img)

In [None]:
plt.imshow(img[:,:,0])

In [None]:
cropped_img = img[200:470, 100:500, :]
print(img.shape)
plt.imshow(cropped_img)

In [None]:
from skimage import io,filters
edges = filters.sobel(img)  # It is used to create the image emphasising edges of the original image, which can act as a part of the pipeline of a system involving recognition or classification
io.imshow(edges)
io.show()