# <font color = 'pickle'> **Customized Pre-processor** </font>
- In this notebook, we will use a custom pre-processor. 
- We have used spacy and created a class which we can use in this and future notebooks. The pre-processor is compatible with sklearn and can be used in sklearn pipelines. 
- We wil also learn how to import functions/classes from .py files.

In [1]:
%load_ext autoreload
%autoreload 2

Code explanation:
- The code is using the Jupyter magic commands to set up the environment for the execution of the code.

- The `%load_ext` autoreload line is loading the autoreload extension, which reloads modules automatically before executing the code in a Jupyter Notebook.

- The`%autoreload 2` line sets the autoreload option to 2, which means that all modules imported into the environment are reloaded before execution. 

<font color = 'indianred'> **This is useful when working with modules that change frequently during development, as it ensures that the latest version of the module is always used.**

In [2]:
import sys

- The `"sys.path"` list is a list of directories that the Python interpreter searches for modules to import. 

In [3]:
from basic import basic_functions as bf
base_folder,data,archive,output = bf.set_folders()

Not Running on Colab
Base Folder is C:\Users\abdul\OneDrive\Documents\MSBA
Data Folder is C:\Users\abdul\OneDrive\Documents\MSBA\data_sets
Archive Folder is C:\Users\abdul\OneDrive\Documents\MSBA\archive
Output Folder is C:\Users\abdul\OneDrive\Documents\MSBA\output
The path to the custom functions is C:/Users/abdul/OneDrive/Documents/MSBA/custom_functions
The working directory is c:\Users\abdul\OneDrive\Documents\MSBA\notebooks\NLP


In [None]:
sys.path

['c:\\Users\\abdul\\OneDrive\\Documents\\MSBA\\notebooks\\NLP',
 'c:\\Users\\abdul\\anaconda3\\python310.zip',
 'c:\\Users\\abdul\\anaconda3\\DLLs',
 'c:\\Users\\abdul\\anaconda3\\lib',
 'c:\\Users\\abdul\\anaconda3',
 '',
 'c:\\Users\\abdul\\anaconda3\\lib\\site-packages',
 'c:\\users\\abdul\\onedrive\\documents\\msba\\custom_functions',
 'c:\\users\\abdul\\onedrive\\documents\\msba\\custom_functions\\basic',
 'c:\\Users\\abdul\\anaconda3\\lib\\site-packages\\win32',
 'c:\\Users\\abdul\\anaconda3\\lib\\site-packages\\win32\\lib',
 'c:\\Users\\abdul\\anaconda3\\lib\\site-packages\\Pythonwin',
 'C:/Users/abdul/OneDrive/Documents/MSBA/custom_functions']

- As you can see above, we have added `'C:/Users/abdul/OneDrive/Documents/MSBA/custom_functions'` to `sys.path`. We can now import modules from this folder.
- We can now create `.py` files and import functions/Classes from the file.

In [4]:
# we will be able to use the function and classes from file custom_preprocessor.py located in folder added to sys.path
import custom_preprocessor_mod as cp

In [5]:
!python -m spacy download en_core_web_sm -qq

[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [6]:
text = ["""New version of operation system is iOS 11. It is better than iOS 9.
The new version of iPhone X seems cool.""", """ <p> The video of iphone x released. I liked iOS 9 but I like iOS 11 more.
You may not like my like @Tech_Guru #Iphone #IOS harpreet@utdallas.edu  https://jindal.utdallas.edu/""", 
"""</p><p>The concept of regular expressions began in the 1950s, when the American mathematician <a href="/wiki/Stephen_Cole_Kleene" title="Stephen Cole Kleene">Stephen Cole Kleene</a> formalized the description of a <i><a href="/wiki/Regular_language" title="Regular language">regular language</a></i>. They came into common use with <a href="/wiki/Unix" title="Unix">Unix</a> text-processing utilities. Different <a href="/wiki/Syntax_(programming_languages)" title="Syntax (programming languages)">syntaxes</a> for writing regular expressions have existed since the 1980s, one being the <a href="/wiki/POSIX" title="POSIX">POSIX</a> standard and another, widely used, being the <a href="/wiki/Perl" title="Perl">Perl</a> syntax.
</p><p>Regular expressions are used in <a href="/wiki/Search_engine" title="Search engine">search engines</a>, search and replace dialogs of <a href="/wiki/Word_processor" title="Word processor">word processors</a> and <a href="/wiki/Text_editor" title="Text editor">text editors</a>, in <a href="/wiki/Text_processing" title="Text processing">text processing</a> utilities such as <a href="/wiki/Sed" title="Sed">sed</a> and <a href="/wiki/AWK" title="AWK">AWK</a> and in <a href="/wiki/Lexical_analysis" title="Lexical analysis">lexical analysis</a>. Many <a href="/wiki/Programming_language" title="Programming language">programming languages</a> provide regex capabilities either built-in or via <a href="/wiki/Library_(computing)" title="Library (computing)">libraries</a>, as it has uses in many situations.
</p> """]

In [7]:
import textwrap as tw

The `textwrap` module in Python provides a simple and convenient way to format and wrap text. The `textwrap` module provides two functions for wrapping text: `wrap()` and `fill()`.

- `textwrap.wrap()` takes a string of text and returns a list of strings, each of which is a single line of the wrapped text.

- `textwrap.fill()` takes a string of text and returns a single string that contains the wrapped text with line breaks inserted.

The difference between `textwrap.wrap()` and `textwrap.fill()` is the format in which the wrapped text is returned. `wrap()` returns the wrapped text as a list of strings, each of which is a separate line. `fill()` returns the wrapped text as a single string, with line breaks inserted to wrap the text to the specified width.

- You might use `wrap()` if you need to process the wrapped text line by line, for example, to display the lines in a GUI or to write the lines to a file. 
- You might use `fill()` if you need to return the wrapped text as a single string, for example, to display the text in a console or to insert it into an HTML document.

In [8]:
for item in text:
  print()
  print(tw.fill(item, width =100))


New version of operation system is iOS 11. It is better than iOS 9. The new version of iPhone X
seems cool.

 <p> The video of iphone x released. I liked iOS 9 but I like iOS 11 more. You may not like my like
@Tech_Guru #Iphone #IOS harpreet@utdallas.edu  https://jindal.utdallas.edu/

</p><p>The concept of regular expressions began in the 1950s, when the American mathematician <a
href="/wiki/Stephen_Cole_Kleene" title="Stephen Cole Kleene">Stephen Cole Kleene</a> formalized the
description of a <i><a href="/wiki/Regular_language" title="Regular language">regular
language</a></i>. They came into common use with <a href="/wiki/Unix" title="Unix">Unix</a> text-
processing utilities. Different <a href="/wiki/Syntax_(programming_languages)" title="Syntax
(programming languages)">syntaxes</a> for writing regular expressions have existed since the 1980s,
one being the <a href="/wiki/POSIX" title="POSIX">POSIX</a> standard and another, widely used, being
the <a href="/wiki/Perl" title="Perl">

In [9]:
# let us look at the custom preprocessor from the custom module we imported earlier
cp.SpacyPreprocessor?

[1;31mInit signature:[0m
[0mcp[0m[1;33m.[0m[0mSpacyPreprocessor[0m[1;33m([0m[1;33m
[0m    [0mmodel[0m[1;33m,[0m[1;33m
[0m    [1;33m*[0m[1;33m,[0m[1;33m
[0m    [0mbatch_size[0m[1;33m=[0m[1;36m64[0m[1;33m,[0m[1;33m
[0m    [0mlemmatize[0m[1;33m=[0m[1;32mTrue[0m[1;33m,[0m[1;33m
[0m    [0mlower[0m[1;33m=[0m[1;32mTrue[0m[1;33m,[0m[1;33m
[0m    [0mremove_stop[0m[1;33m=[0m[1;32mTrue[0m[1;33m,[0m[1;33m
[0m    [0mremove_punct[0m[1;33m=[0m[1;32mTrue[0m[1;33m,[0m[1;33m
[0m    [0mremove_email[0m[1;33m=[0m[1;32mTrue[0m[1;33m,[0m[1;33m
[0m    [0mremove_url[0m[1;33m=[0m[1;32mTrue[0m[1;33m,[0m[1;33m
[0m    [0mremove_num[0m[1;33m=[0m[1;32mFalse[0m[1;33m,[0m[1;33m
[0m    [0mstemming[0m[1;33m=[0m[1;32mFalse[0m[1;33m,[0m[1;33m
[0m    [0madd_user_mention_prefix[0m[1;33m=[0m[1;32mTrue[0m[1;33m,[0m[1;33m
[0m    [0mremove_hashtag_prefix[0m[1;33m=[0m[1;32mFalse[0m[1;33m,[0m[1;33m


In [10]:
# start with only lammetize = True
# add_user_mention_prefix= True
# remove_hashtag_prefix= False
# make both the above as True and see what happens
# remove punctuation 
# lowercase (note lammetization converts everything to lower case except proper nouns)
# make stemming = True (nothing should change)
# to do stemming we  have to make lammetization = False, stemming = True
# remove stop words
# remove emails
# remove urls
# remove num (spacy did not remove 1950, 1980)
# change both lemmatization and stemming to False
# we see that the text was 1950s and 1980s - lemmas were 1950 and 1980 - but spacy does not treat these as numbers
# We will have to write regular expression if we want to remove these.

In [13]:
# import spacy pre-processor from custom module
preprocessor = cp.SpacyPreprocessor(model = 'en_core_web_sm', batch_size=64, lemmatize=False, lower=False, remove_stop=False, 
                remove_punct=False, remove_email=False, remove_url=False, remove_num=False, stemming = False,
                add_user_mention_prefix=False, remove_hashtag_prefix=False)

In [14]:
cleaned_text = preprocessor.fit_transform(text)

for item in cleaned_text:
  print()
  print(tw.fill(item, width =100))


New version of operation system is iOS 11 . It is better than iOS 9 . The new version of iPhone X
seems cool .

   The video of iphone x released . I liked iOS 9 but I like iOS 11 more . You may not like my like
@Tech_Guru # Iphone # IOS harpreet@utdallas.edu   https://jindal.utdallas.edu/

The concept of regular expressions began in the 1950s , when the American mathematician Stephen Cole
Kleene formalized the description of a regular language . They came into common use with Unix text -
processing utilities . Different syntaxes for writing regular expressions have existed since the
1980s , one being the POSIX standard and another , widely used , being the Perl syntax . Regular
expressions are used in search engines , search and replace dialogs of word processors and text
editors , in text processing utilities such as sed and AWK and in lexical analysis . Many
programming languages provide regex capabilities either built - in or via libraries , as it has uses
in many situations .


  matches = matcher(doc)


In [17]:
preprocessor = cp.SpacyPreprocessor(model = 'en_core_web_sm', batch_size=64, lemmatize=False, lower=True, remove_stop=True, 
                remove_punct=True, remove_email=True, remove_url=True, remove_num=False, stemming = True,
                add_user_mention_prefix=False, remove_hashtag_prefix=True)
cleaned_text = preprocessor.fit_transform(text)

for item in cleaned_text:
  print()
  print(tw.fill(item, width =100))


new version oper system io 11 better io 9 new version iphon x cool

   video iphon x releas like io 9 like io 11 like like @tech_guru #iphon #io

concept regular express began 1950 american mathematician stephen cole kleen formal descript regular
languag came common use unix text process util differ syntax write regular express exist 1980 posix
standard wide perl syntax regular express search engin search replac dialog word processor text
editor text process util sed awk lexic analysi program languag provid regex capabl built librari use
situat


In [18]:
import pandas as pd
my_data = pd.DataFrame(text, columns=['text'])
my_data

Unnamed: 0,text
0,New version of operation system is iOS 11. It ...
1,<p> The video of iphone x released. I liked i...
2,</p><p>The concept of regular expressions bega...


In [19]:
my_data['cleaned_text'] = cleaned_text
my_data

Unnamed: 0,text,cleaned_text
0,New version of operation system is iOS 11. It ...,new version oper system io 11 better io 9 new ...
1,<p> The video of iphone x released. I liked i...,video iphon x releas like io 9 like io 11 l...
2,</p><p>The concept of regular expressions bega...,concept regular express began 1950 american ma...
