# Using Jupyter Notebooks



In [None]:
!pip install nltk spacy gensim pandas scikit-learn

#NLTK (Natural Language Toolkit):

#A comprehensive library for working with human language data (text) in Python. NLTK provides tools for text processing, including tokenization, parsing, stemming, tagging, and semantic reasoning, and is often used in natural language processing (NLP) tasks. It includes a wide array of corpora and lexical resources, such as WordNet, which are helpful in linguistics and language analysis.
#spaCy:

#An advanced NLP library designed for fast and efficient processing of large volumes of text. SpaCy offers high-performance solutions for tokenization, part-of-speech tagging, named entity recognition (NER), dependency parsing, and word vector generation. It’s widely used for tasks requiring speed and accuracy in text processing, making it popular for building NLP models in production environments.
#Gensim:

#A Python library for topic modeling, document similarity analysis, and text mining. Gensim specializes in unsupervised and large-scale machine learning, especially in handling text data. It provides implementations for popular algorithms like Word2Vec, Doc2Vec, and Latent Dirichlet Allocation (LDA), allowing users to efficiently find patterns and relationships within text data.
#Pandas:

#A powerful and flexible data manipulation library for Python, essential for data analysis and data wrangling tasks. Pandas provides two primary data structures: DataFrame and Series, which allow for easy data manipulation, filtering, aggregation, and analysis. It is particularly useful for working with structured data, enabling data scientists and analysts to handle datasets in a tabular format similar to SQL or Excel.
#scikit-learn:

#A machine learning library built on top of NumPy, SciPy, and matplotlib, which provides simple and efficient tools for predictive data analysis. Scikit-learn offers a range of supervised and unsupervised learning algorithms, including classification, regression, clustering, and dimensionality reduction. It is designed for rapid prototyping and is widely used for implementing machine learning models in research and production.





##Introduction
Jupyter notebooks is an open-source web-based Python editor which runs in your browser. It allows a combination of text written in a html-like format known as "markdown", such as the block of text you're reading right now, and inline code, tools and outputs such as this one:

In [None]:
import nltk
nltk.download('punkt') # Download the Punkt tokenizer
from nltk.tokenize import word_tokenize

sentence = "Hello, world! This is NLP."
tokens = word_tokenize(sentence)
print(tokens) # Output: ['Hello', ',', 'world', '!', 'This', 'is','NLP', '.']

#nltk:

#Short for Natural Language Toolkit, NLTK is a Python library that provides tools for text processing, including tokenization, parsing, stemming, and various NLP tasks. It also includes access to large lexical resources like WordNet.
#download:

#A method used here to download NLTK's "punkt" tokenizer. The download function in NLTK allows you to fetch various text-processing resources like tokenizers, corpora, and models directly from the NLTK server.
#punkt:

#A pre-trained tokenizer model in NLTK designed for sentence and word tokenization. It helps break down sentences into individual tokens, handling punctuation and other language-specific tokenization rules.
tokenize:

#A process in NLP to split text into smaller components called tokens, such as words or sentences. Tokenization is essential for preparing raw text for further analysis or model training.
word_tokenize:

#A specific tokenizer in NLTK used to break down a sentence into individual words and punctuation marks. It applies the Punkt tokenizer model to split text at word boundaries.
#sentence:

#In this code, sentence is a variable holding a string of text to be tokenized. In NLP, a sentence typically refers to a sequence of words that form a complete thought and is used as a unit of analysis.
tokens:

#A list that stores the individual words and punctuation marks obtained after tokenizing the sentence. In NLP, tokens are the basic units of analysis, representing each word, symbol, or punctuation mark separately.

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


['Hello', ',', 'world', '!', 'This', 'is', 'NLP', '.']


In [None]:
import nltk
nltk.download('punkt') # Download the Punkt tokenizer
from nltk.tokenize import word_tokenize
# nltk:
# Short for Natural Language Toolkit, nltk is a powerful Python library for natural language processing (NLP).
# It provides tools for processing text data, such as tokenization, stemming, and tagging,
# and includes access to resources like WordNet for linguistic analysis.
#
# download:
# The download function in nltk allows users to fetch resources, such as tokenizers, corpora,
# and models, directly from NLTK’s server. Here, it’s downloading the "punkt" tokenizer.
#
# punkt:
# "Punkt" is a pre-trained tokenizer model within NLTK that’s used for sentence and word tokenization.
# It was specifically trained to handle common punctuation and tokenization rules, making it useful
# for accurately splitting sentences and words in various languages.
#
# tokenize:
# Tokenization is the process of dividing text into smaller units, such as words or sentences,
# known as tokens. This is a crucial step in NLP as it allows text to be processed and analyzed more effectively.
#
# word_tokenize:
# A specific function in NLTK’s tokenize module that splits a given sentence or text into individual
# words and punctuation marks. word_tokenize uses the Punkt tokenizer model to segment the text at word boundaries.


In [None]:
from nltk.corpus import stopwords
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word.lower() not in
stop_words]
print(filtered_tokens)

# stopwords:
# A predefined list of commonly used words (like "and", "the", "is") in a language
# that are often removed from text during processing. Stopwords typically carry
# little meaning and do not contribute significantly to the context of the data.
#
# nltk:
# Short for Natural Language Toolkit, nltk is a Python library that provides tools
# and resources for processing human language data (text). It includes modules for
# text classification, tokenization, stemming, tagging, parsing, and semantic reasoning.
#
# download:
# A method in NLTK used to retrieve various resources such as corpora and models
# from the NLTK server. Here, it downloads the dataset of stop words.
#
# set():
# A built-in Python data structure that creates an unordered collection of unique
# elements. It is useful for membership testing and eliminating duplicate entries from a list.
#
# words():
# A function from the stopwords module in NLTK that returns a list of stop words
# for a specified language. In this case, it retrieves the list for English.
#
# filtered_tokens:
# A list that stores tokens after removing stop words. This variable holds the final
# result, consisting of words that contribute more meaning to the text.
#
# tokens:
# A list of words or tokens derived from an initial text, which can include words
# and punctuation. This variable is expected to be defined earlier in the code (from the tokenization step).



['Hello', ',', 'world', '!', 'NLP', '.']


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
ps = PorterStemmer()
lemmatizer = WordNetLemmatizer()
print(ps.stem("faster")) # Output: run
print(lemmatizer.lemmatize("faster")) # Output: running (more context needed for lemmatization)

# PorterStemmer:
# A class in NLTK used for stemming, which reduces words to their root or base form
# by removing prefixes and suffixes. The Porter algorithm is one of the most commonly
# used stemming algorithms.
#
# WordNetLemmatizer:
# A class in NLTK that performs lemmatization, which is the process of converting
# a word to its base form (lemma). Unlike stemming, lemmatization considers the
# context of the word to return a valid root form.
#
# download:
# A method in NLTK used to retrieve resources such as corpora and models from
# the NLTK server. In this case, it downloads the WordNet database, which is
# essential for lemmatization.
#
# stem():
# A method in the PorterStemmer class that takes a word as input and returns
# its stemmed version. The process may not always yield a meaningful root form,
# especially for irregular words.
#
# lemmatize():
# A method in the WordNetLemmatizer class that converts a word into its lemma
# based on its intended meaning. Additional context (like part of speech) can
# enhance its accuracy.



faster
faster


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [None]:
import pandas as pd
import nltk
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics

# pandas:
# A powerful Python library used for data manipulation and analysis, particularly
# useful for working with structured data like tables (DataFrames).
#
# nltk:
# Short for Natural Language Toolkit, nltk is a Python library providing tools
# for natural language processing (NLP), including text processing, tokenization,
# and linguistic analysis.
#
# train_test_split:
# A function in the sklearn.model_selection module that divides a dataset into
# training and testing subsets, helping in model evaluation by ensuring that
# the model is tested on unseen data.
#
# CountVectorizer:
# A class in the sklearn.feature_extraction.text module that converts a
# collection of text documents into a matrix of token counts, facilitating the
# transformation of text data into a numerical format for machine learning.
#
# MultinomialNB:
# A class from the sklearn.naive_bayes module that implements the Naive Bayes
# algorithm for classification tasks, particularly effective for discrete
# features like word counts in text classification.
#
# metrics:
# A module in sklearn that provides functions to evaluate the performance of
# machine learning models, including accuracy, precision, recall, and F1-score.


In [None]:
data = {
 'text': [
 'I love this movie!',
 'This was a terrible movie.',
 'I really enjoyed the film.',
 'Worst experience ever.',
 'It was fantastic!',
 'Not worth the time.',
 'Absolutely amazing!',
 'It was okay, not great.',
 'I hate this film.',
 'Best movie ever!'
 ],
 'sentiment': [
 'negative',
 'positive',
 'negative',
 'positive',
 'negative',
 'positive',
 'negative',
 'neutral',
 'negative',
 'positive'
 ]
}

# data:
# A variable name used to store structured information in Python, often in the form of dictionaries, lists, or other data types.
#
# dictionary:
# A built-in data type in Python that stores data in key-value pairs, allowing for efficient data retrieval.
#
# text:
# A key in the dictionary representing a collection of textual data, such as user reviews or comments.
#
# sentiment:
# A key in the dictionary that categorizes the emotional tone expressed in the corresponding text samples (e.g., positive, negative, neutral).
#
# list:
# A built-in data structure in Python that holds an ordered collection of items, which can be of different types.


In [None]:
df = pd.DataFrame(data)

# df:
# A variable name used to store a DataFrame object, which is a two-dimensional, size-mutable, and potentially heterogeneous tabular data structure in pandas.
#
# pd:
# An alias commonly used for the pandas library in Python, allowing for easier access to its functions and classes.
#
# DataFrame:
# A two-dimensional, size-mutable, and potentially heterogeneous tabular data structure in pandas, with labeled axes (rows and columns), useful for data manipulation and analysis.
#
# data:
# A variable containing structured information (e.g., dictionary) that is being converted into a DataFrame format.


In [None]:
print(df)

                         text sentiment
0          I love this movie!  negative
1  This was a terrible movie.  positive
2  I really enjoyed the film.  negative
3      Worst experience ever.  positive
4           It was fantastic!  negative
5         Not worth the time.  positive
6         Absolutely amazing!  negative
7     It was okay, not great.   neutral
8           I hate this film.  negative
9            Best movie ever!  positive


In [None]:
X = df['text']
y = df['sentiment']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
# Vectorize the text
vectorizer = CountVectorizer()
X_train_vectorized = vectorizer.fit_transform(X_train)
X_test_vectorized = vectorizer.transform(X_test)

# X:
# A variable representing the input features (text data) for the model, extracted from the DataFrame.
#
# y:
# A variable representing the target labels (sentiment) for the model, extracted from the DataFrame.
#
# X_train, X_test:
# Variables that store the training and testing subsets of the input features, respectively.
#
# y_train, y_test:
# Variables that store the training and testing subsets of the target labels, respectively.
#
# train_test_split:
# A function from sklearn that splits arrays or matrices into random train and test subsets, facilitating model evaluation.
#
# test_size:
# A parameter that specifies the proportion of the dataset to include in the test split (0.2 means 20%).
#
# random_state:
# A parameter that controls the shuffling applied to the data before applying the split, ensuring reproducibility.
#
# vectorizer:
# A variable that stores an instance of CountVectorizer, which is used to convert a collection of text documents into a matrix of token counts.
#
# CountVectorizer:
# A class from sklearn that converts a collection of text documents into a matrix of token counts, enabling numerical representation of text data.
#
# fit_transform:
# A method that fits the vectorizer to the input data and transforms it into a numerical format in one step.
#
# transform:
# A method that transforms new data into the same numerical format as the training data without fitting the vectorizer again.
#
# X_train_vectorized:
# A variable that stores the vectorized representation of the training input features.
#
# X_test_vectorized:
# A variable that stores the vectorized representation of the testing input features.



In [None]:
model = MultinomialNB()
model.fit(X_train_vectorized, y_train)

# model:
# A variable that stores an instance of the Multinomial Naive Bayes classifier, which will be used to train and make predictions on the data.
#
# MultinomialNB:
# A class from the sklearn.naive_bayes module that implements the Naive Bayes algorithm specifically for multinomially distributed data, commonly used for text classification.
#
# fit:
# A method that trains the model using the provided input features and target labels, adjusting the model parameters to minimize error.
#
# X_train_vectorized:
# A variable that contains the vectorized representation of the training input features, which the model will use to learn from.
#
# y_train:
# A variable that contains the target labels corresponding to the training input features, used for training the model.



In [None]:
y_pred = model.predict(X_test_vectorized)

# y_pred:
# A variable that stores the predicted sentiment labels for the test dataset, generated by the model.
#
# predict:
# A method that takes new input features and returns the predicted target labels based on the model's learned parameters.
#
# model:
# Refers to the trained Multinomial Naive Bayes classifier that has been fitted to the training data.
#
# X_test_vectorized:
# A variable that contains the vectorized representation of the testing input features, which are used by the model to generate predictions.


In [None]:
accuracy = metrics.accuracy_score(y_test, y_pred)
confusion_matrix = metrics.confusion_matrix(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')
print('Confusion Matrix:')
print(confusion_matrix)

# accuracy:
# A variable that stores the proportion of correct predictions made by the model compared to the total number of predictions, expressed as a decimal.
#
# metrics:
# A module in sklearn that provides various functions for evaluating the performance of machine learning models.
#
# accuracy_score:
# A function from the metrics module that calculates the accuracy of the model by comparing the actual and predicted labels.
#
# confusion_matrix:
# A variable that holds a matrix that summarizes the number of correct and incorrect predictions, categorized by class.
#
# confusion_matrix:
# A function from the metrics module that computes the confusion matrix to evaluate the accuracy of a classification.
#
# print:
# A built-in Python function that outputs data to the console.
#
# f-string:
# A string formatting method in Python that allows embedding expressions inside string literals, prefixed by `f` or `F`, using curly braces `{}`.
#
# y_test:
# A variable containing the true sentiment labels for the test dataset, used for comparison with predictions.
#
# y_pred:
# A variable containing the predicted sentiment labels generated by the model for the test dataset.


Accuracy: 0.00
Confusion Matrix:
[[0 2]
 [0 0]]


In [None]:
def predict_sentiment(text):
 text_vectorized = vectorizer.transform([text])
 prediction = model.predict(text_vectorized)
 return prediction[0]
# Example usage
new_text = "I loved the plot and the acting!"
print(f'Sentiment: {predict_sentiment(new_text)}')

# predict_sentiment:
# A function defined to predict the sentiment of a given text input based on the trained model.
#
# text:
# A parameter representing the input text for which sentiment needs to be predicted.
#
# text_vectorized:
# A variable that holds the vectorized representation of the input text, prepared for input to the model.
#
# vectorizer:
# The CountVectorizer instance that was fitted on the training data, used to convert text data into numerical format.
#
# transform:
# A method that converts new text data into the same vectorized format used for training, allowing the model to process it.
#
# prediction:
# A variable that stores the output from the model, which indicates the predicted sentiment for the input text.
#
# model:
# Refers to the trained Multinomial Naive Bayes classifier used to make predictions.
#
# return:
# A statement that exits the function and provides the specified value back to the caller.
#
# new_text:
# A variable that holds a sample text for testing the sentiment prediction function.
#
# print:
# A built-in Python function that outputs the specified message or variable to the console.
#
# f-string:
# A string formatting method that allows embedding expressions within string literals, making it easier to create formatted output.


Sentiment: negative


In [None]:
print("Hello World")

# print:
# A built-in Python function used to display output to the console or standard output device.
#
# "Hello World":
# A string literal that contains the text "Hello World", commonly used as a basic example in programming tutorials.
#
# " ":
# Quotation marks used to define the beginning and end of a string in Python.


This combination allows for the procution of beautiful documents containing software, documentation and discussion. For larger codes you may wish to use Python in a stand-alone environment such as a traditional IDE. But for demonstration purposes Jupyter is a very useful tool.

Notebook files have the extension ".ipynb" extension. A Jupyter notebook is one of many environments you may run Python code.  Colab and the Jupyter notebook editor in Anaconda are two of the many pieces of software you may use to write and run a Jupyter notebook. For this course we recommend using the online Google Colab tool, but you can use Anaconda to run the notebooks on your own machine within an internet connection. On college computers, Jupyter can be used by launchng Anaconda from the Software Hub Apps Anywhere interface.

Note that exact interfaces will differ between different environments but the same functionality should be found in most environments. This course will be using the Colab environment.

## Cells and Executing Code

A notebooks is made up of one or more "cells". Cells can contain the html-like text used to generate text or code to be run by the user. A cell containing a piece of code may be recognised by the the  ```[]```  to the left of it. Code in these blocks can be run in a nubmer of ways. The simplest is click on the ```[ ]``` . This will execute the code. Try this with the code snippet below:

In [None]:
print("Yes, it worked!")

You should have seen the message "Yes, it worked!" appear immediately beneath the code. This is the output of the code, which has been printed to the screen. You may also have noticed a number appear between the square brackets to the left of the code snippet. This indicates the order in which the code snippet has been executed. Code cells may be executed in any order and variables will be saved between execution of code snippets. To try this, execute the three codes snippets below in the following order:
- 1
- 2
- 3
- 2

In [None]:
a="Message 1"

In [None]:
print(a)

In [None]:
a="Message 2"

The first time you ran code snippet 1 you should have seen "Message 1" as the output and the second time the output should have been "Message 2". This is because the first time it was run, the value assigned to the variable named "a" was "Message" as set by the first code snippet and the second time it was "Message 2" as set by the third code snippet. Note also the current numbers contained within square brackets. These help you to kno which cells have been executed and in which order.

##Sharing Jupyter Notebooks on Colab
When a Jupyter Notebook is shared with you on Colab, you will often receive access to the notebook which will alow you to run code, but not edit it. This should be the case for the notebooks that form part of this course. In this case you can select "Save a Copy in Drive" from the "File" menu to create a new copy that is yours and yo can edit.

For this course, it is reccommended that you create two copies. One of these should be the original copy without your edits, and another which you can edit to compelte exercises or expierment.

## Basic Jupyter Commands

Jupyter contains a number of useful tools for executing these cells. By using the "Runtime" menu, you can run multiple cells at a time using "Run all", "Run before", "Run selected" and "Run after".

You can clear output (this is the term for what is written under a code cell when it's executed) by clicking on the symbol to the left of it. You can clear all outputs from the notebook using the "Clear All Outputs" command on the "Edit" menu. Clearing the output will not unset variables set by the code snippets run, only remove the output printed to the screen.

To unset variables, use the "Restart Runtime" or "Reset Runtime" option in the Runtime menu. The "Interrupt Execution" command on the kernel menu will halt the procesing of code, which can be useful if you've accidentally written a piece of code that will never finish executing or if the code is taking too long to execute.

The "insert" menu allows you to create new cells. The "cell type" option in the "cell" menu allows you toggle the current cell type between the different cell types available:
- **Code**: Code snippets
- **Text**: The html-like language used to generate text, tables, equations, etc.

Alternatively, you can hover your mouse in the space after a cell and add a code or text cell there.

###Exercise

Try each of these commands from the different menus for yourself on this  notebook and ensure they behave as you would expect.

## Text Cells in Jupyter
You can include all sort so information in Jupyter text cells to obtain different effects. To see how each of the following examples is generated, double click on this cell. To return to the formatted text, run the cell.

### Headings
Headings can be generated using the hash symbol "#". The more of these there are, the smaller the heading. The sub-sub-heading above is an example.

### Tables
Tables can be created in a way similar to basic html, using the a comabination of the "|" and "-" symbols:

| This | is    |
|------|-------|
|   a  |  table|
| It's | fancy |

### Equations
Equations can be written in a way similar to LaTeX by surrouding the text with "\$" symbols:

$a=\frac{\int\limits_{0}^{\pi} \sin{(bx)} \textrm{d}x}{4}$

Don't worry if you don't understand the exact syntax used to generate this example. In your example of it in your exercise, try writing something very simple instead. If it looks like a simple algebraic expression, it will probably render how you intend.

### Code Snippets
You can write snippets of code in a text cell and they will be highlighted as if they were code written in a code cell. This can be useful for demonstrating a code feature in a textual way. For example:

```python
print ("Hello World")
```

There is not a way to run this code, it is merely normal text highlighted to look like code. The "python" which precedes the code itself tells Jupyter which language you are writing the code snippet in so it can be highlighted accorindly.

In some environments, text cells may also be referred to as "markdown" cells.

###Exercise
Try creating simple versions of each of the constructs above in a new text cell below this one.