# <font color = 'pickle'> **Customized Pre-processor**
- In this notebook, we will use a custom pre-processor.
- We have used spacy and created a class which we can use in this and future notebooks. The pre-processor is compatible with sklearn and can be used in sklearn pipelines.
- We wil also learn how to import functions/classes from .py files.

In [1]:
%load_ext autoreload
%autoreload 2


Code explanation:
- The code is using the Jupyter magic commands to set up the environment for the execution of the code.

- The `%load_ext` autoreload line is loading the autoreload extension, which reloads modules automatically before executing the code in a Jupyter Notebook.

- The`%autoreload 2` line sets the autoreload option to 2, which means that all modules imported into the environment are reloaded before execution.

<font color = 'indianred'> **This is useful when working with modules that change frequently during development, as it ensures that the latest version of the module is always used.**

In [2]:
import sys

In [3]:
sys.path


['/content',
 '/env/python',
 '/usr/lib/python310.zip',
 '/usr/lib/python3.10',
 '/usr/lib/python3.10/lib-dynload',
 '',
 '/usr/local/lib/python3.10/dist-packages',
 '/usr/lib/python3/dist-packages',
 '/usr/local/lib/python3.10/dist-packages/IPython/extensions',
 '/root/.ipython']

- The `"sys.path"` list is a list of directories that the Python interpreter searches for modules to import.
- The `"sys.path.append()"` method is used to add a directory to this list, allowing Python to find and import modules located in the specified directory.

In [4]:
# Check if the code is being executed in Google Colab
if 'google.colab' in str(get_ipython()):

    # Install the spacy library with the -qq flag for quiet output
    !pip install - U spacy - qq

    # Import the drive module from google.colab
    # Mount Google Drive to access files and directories
    from google.colab import drive
    drive.mount('/content/drive')

    # Add the path to the custom-functions directory in Google Drive to sys.path
    sys.path.append('/content/drive/MyDrive/data/custom-functions')

    # Set the basepath to the data directory in Google Drive
    basepath = '/content/drive/MyDrive/data'

else:
    # Add the path to the custom-functions directory in the local file system to sys.path
    sys.path.append(
        '/home/harpreet/Insync/google_drive_shaannoor/data/custom-functions')

    # Set the basepath to the data directory in the local file system
    basepath = ('/home/harpreet/Insync/google_drive_shaannoor/data')


[31mERROR: Invalid requirement: '-'[0m[31m
[0mMounted at /content/drive


In [5]:
sys.path


['/content',
 '/env/python',
 '/usr/lib/python310.zip',
 '/usr/lib/python3.10',
 '/usr/lib/python3.10/lib-dynload',
 '',
 '/usr/local/lib/python3.10/dist-packages',
 '/usr/lib/python3/dist-packages',
 '/usr/local/lib/python3.10/dist-packages/IPython/extensions',
 '/root/.ipython',
 '/content/drive/MyDrive/data/custom-functions']

- As you can see above, we have added `'/content/drive/MyDrive/data/custom-functions'` to `sys.path`. We can now import modules from this folder.
- We can now create `.py` files and import functions/Classes from the file.

In [6]:
# we will be able to use the function and classes from file custom_preprocessor.py located in folder added to sys.path
import custom_preprocessor_mod as cp


In [7]:
!python -m spacy download en_core_web_sm


Collecting en-core-web-sm==3.6.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.6.0/en_core_web_sm-3.6.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m25.0 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [8]:
text = ["""New version of operation system is iOS 11. It is better than iOS 9.
The new version of iPhone X seems cool.""", """ <p> The video of iphone x released. I liked iOS 9 but I like iOS 11 more.
You may not like my like @Tech_Guru #Iphone #IOS harpreet@utdallas.edu  https://jindal.utdallas.edu/""",
        """</p><p>The concept of regular expressions began in the 1950s, when the American mathematician <a href="/wiki/Stephen_Cole_Kleene" title="Stephen Cole Kleene">Stephen Cole Kleene</a> formalized the description of a <i><a href="/wiki/Regular_language" title="Regular language">regular language</a></i>. They came into common use with <a href="/wiki/Unix" title="Unix">Unix</a> text-processing utilities. Different <a href="/wiki/Syntax_(programming_languages)" title="Syntax (programming languages)">syntaxes</a> for writing regular expressions have existed since the 1980s, one being the <a href="/wiki/POSIX" title="POSIX">POSIX</a> standard and another, widely used, being the <a href="/wiki/Perl" title="Perl">Perl</a> syntax.
</p><p>Regular expressions are used in <a href="/wiki/Search_engine" title="Search engine">search engines</a>, search and replace dialogs of <a href="/wiki/Word_processor" title="Word processor">word processors</a> and <a href="/wiki/Text_editor" title="Text editor">text editors</a>, in <a href="/wiki/Text_processing" title="Text processing">text processing</a> utilities such as <a href="/wiki/Sed" title="Sed">sed</a> and <a href="/wiki/AWK" title="AWK">AWK</a> and in <a href="/wiki/Lexical_analysis" title="Lexical analysis">lexical analysis</a>. Many <a href="/wiki/Programming_language" title="Programming language">programming languages</a> provide regex capabilities either built-in or via <a href="/wiki/Library_(computing)" title="Library (computing)">libraries</a>, as it has uses in many situations.
</p> """]


In [9]:
import textwrap as tw


The `textwrap` module in Python provides a simple and convenient way to format and wrap text. The `textwrap` module provides two functions for wrapping text: `wrap()` and `fill()`.

- `textwrap.wrap()` takes a string of text and returns a list of strings, each of which is a single line of the wrapped text.

- `textwrap.fill()` takes a string of text and returns a single string that contains the wrapped text with line breaks inserted.

The difference between `textwrap.wrap()` and `textwrap.fill()` is the format in which the wrapped text is returned. `wrap()` returns the wrapped text as a list of strings, each of which is a separate line. `fill()` returns the wrapped text as a single string, with line breaks inserted to wrap the text to the specified width.

- You might use `wrap()` if you need to process the wrapped text line by line, for example, to display the lines in a GUI or to write the lines to a file.
- You might use `fill()` if you need to return the wrapped text as a single string, for example, to display the text in a console or to insert it into an HTML document.

In [10]:
for item in text:
    print()
    print(tw.fill(item, width=100))



New version of operation system is iOS 11. It is better than iOS 9. The new version of iPhone X
seems cool.

 <p> The video of iphone x released. I liked iOS 9 but I like iOS 11 more. You may not like my like
@Tech_Guru #Iphone #IOS harpreet@utdallas.edu  https://jindal.utdallas.edu/

</p><p>The concept of regular expressions began in the 1950s, when the American mathematician <a
href="/wiki/Stephen_Cole_Kleene" title="Stephen Cole Kleene">Stephen Cole Kleene</a> formalized the
description of a <i><a href="/wiki/Regular_language" title="Regular language">regular
language</a></i>. They came into common use with <a href="/wiki/Unix" title="Unix">Unix</a> text-
processing utilities. Different <a href="/wiki/Syntax_(programming_languages)" title="Syntax
(programming languages)">syntaxes</a> for writing regular expressions have existed since the 1980s,
one being the <a href="/wiki/POSIX" title="POSIX">POSIX</a> standard and another, widely used, being
the <a href="/wiki/Perl" title="Perl">

In [11]:
# let us look at the custom preprocessor from the custom module we imported earlier
cp.SpacyPreprocessor?


In [12]:
# import spacy pre-processor from custom module
preprocessor = cp.SpacyPreprocessor(model='en_core_web_sm', batch_size=64, lemmatize=False, lower=False,
                                    remove_stop=False, remove_punct=False, remove_email=False,
                                    remove_url=False, remove_num=False, stemming=False,
                                    add_user_mention_prefix=False, remove_hashtag_prefix=False)


In [13]:
cleaned_text = preprocessor.fit_transform(text)

for item in cleaned_text:
    print()
    print(tw.fill(item, width=100))



New version of operation system is iOS 11 . It is better than iOS 9 . The new version of iPhone X
seems cool .

   The video of iphone x released . I liked iOS 9 but I like iOS 11 more . You may not like my like
@Tech_Guru # Iphone # IOS harpreet@utdallas.edu   https://jindal.utdallas.edu/

The concept of regular expressions began in the 1950s , when the American mathematician Stephen Cole
Kleene formalized the description of a regular language . They came into common use with Unix text -
processing utilities . Different syntaxes for writing regular expressions have existed since the
1980s , one being the POSIX standard and another , widely used , being the Perl syntax . Regular
expressions are used in search engines , search and replace dialogs of word processors and text
editors , in text processing utilities such as sed and AWK and in lexical analysis . Many
programming languages provide regex capabilities either built - in or via libraries , as it has uses
in many situations .


  matches = matcher(doc)


In [14]:
preprocessor = cp.SpacyPreprocessor(model='en_core_web_sm', batch_size=64, lemmatize=False, lower=False,
                                    remove_stop=False, remove_punct=False, remove_email=False,
                                    remove_url=False, remove_num=False, stemming=False,
                                    add_user_mention_prefix=False, remove_hashtag_prefix=False)

cleaned_text = preprocessor.fit_transform(text)

for item in cleaned_text:
    print()
    print(tw.fill(item, width=100))



New version of operation system is iOS 11 . It is better than iOS 9 . The new version of iPhone X
seems cool .

   The video of iphone x released . I liked iOS 9 but I like iOS 11 more . You may not like my like
@Tech_Guru # Iphone # IOS harpreet@utdallas.edu   https://jindal.utdallas.edu/

The concept of regular expressions began in the 1950s , when the American mathematician Stephen Cole
Kleene formalized the description of a regular language . They came into common use with Unix text -
processing utilities . Different syntaxes for writing regular expressions have existed since the
1980s , one being the POSIX standard and another , widely used , being the Perl syntax . Regular
expressions are used in search engines , search and replace dialogs of word processors and text
editors , in text processing utilities such as sed and AWK and in lexical analysis . Many
programming languages provide regex capabilities either built - in or via libraries , as it has uses
in many situations .


In [15]:
import pandas as pd
my_data = pd.DataFrame(text, columns=['text'])
my_data


Unnamed: 0,text
0,New version of operation system is iOS 11. It ...
1,<p> The video of iphone x released. I liked i...
2,</p><p>The concept of regular expressions bega...


In [16]:
my_data['cleaned_text'] = cleaned_text
my_data


Unnamed: 0,text,cleaned_text
0,New version of operation system is iOS 11. It ...,New version of operation system is iOS 11 . It...
1,<p> The video of iphone x released. I liked i...,The video of iphone x released . I liked iO...
2,</p><p>The concept of regular expressions bega...,The concept of regular expressions began in th...
