# Using Jupyter Notebooks



In [None]:
!pip install nltk spacy gensim pandas scikit-learn
#pip: A package-management system used to install and manage software packages written in Python.
#install: A command used to install the specified Python libraries or modules.
#nltk: Stands for Natural Language Toolkit, a library in Python used for working with human language data (text) and performing text processing tasks like tokenization, parsing, classification, etc.
#spacy: A library for advanced Natural Language Processing (NLP) in Python, used for tasks like tokenization, part-of-speech tagging, named entity recognition, and more.
#gensim: A Python library used for topic modeling and document similarity analysis, especially popular for tasks related to natural language processing and working with large text collections.
#pandas: A Python library used for data manipulation and analysis, providing data structures like DataFrame, and functions for working with structured data.
#scikit-learn: A Python library used for machine learning and statistical modeling, including tools for classification, regression, clustering, and dimensionality reduction.



##Introduction
Jupyter notebooks is an open-source web-based Python editor which runs in your browser. It allows a combination of text written in a html-like format known as "markdown", such as the block of text you're reading right now, and inline code, tools and outputs such as this one:

In [None]:
import nltk #import: A Python keyword used to bring in external modules or libraries into your script so you can use their functions and classes.
nltk.download('punkt') # punkt: A pre-trained tokenizer that is part of the NLTK library, used for splitting a text into sentences or words (tokenization). It is specifically designed to handle punctuation.
from nltk.tokenize import word_tokenize # nltk.tokenize: A sub-module in the NLTK library used for breaking down text into smaller parts (tokens) like words or sentences.
                                        # word_tokenize: A function from the nltk.tokenize module that splits a sentence into individual words (tokens).
sentence = "Hello, world! This is NLP."
tokens = word_tokenize(sentence) # this will tokenize the given sentence in to tokens
print(tokens) # Output: ['Hello', ',', 'world', '!', 'This', 'is','NLP', '.'] #print: A built-in Python function that outputs the specified value to the console.

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


['Hello', ',', 'world', '!', 'This', 'is', 'NLP', '.']


In [None]:
import nltk
nltk.download('punkt') # Download the Punkt tokenizer
from nltk.tokenize import word_tokenize
#In summary, the snippet prepares for text tokenization by downloading a tokenizer and importing the function required to split sentences into words.

In [None]:
from nltk.corpus import stopwords #nltk.corpus: A sub-module in the NLTK library that provides access to a variety of linguistic resources, such as corpora, lexical resources, and word lists.
                                  #stopwords: A collection of common words (like "the", "is", "in", etc.) that are often removed from text during preprocessing, as they don't carry significant meaning in natural language processing tasks.
nltk.download('stopwords') #nltk.download('stopwords'): This downloads the NLTK stopwords corpus, a pre-compiled list of commonly used stopwords in different languages.
stop_words = set(stopwords.words('english')) #set: A Python data structure that stores unique elements. In this case, it's used to hold the stop words, ensuring that only unique stop words are stored.
                                             #stopwords.words('english'): A method that retrieves the list of English stopwords from the NLTK corpus.
filtered_tokens = [word for word in tokens if word.lower() not in stop_words] #filtered_tokens: A variable that stores the result of filtering the tokens, removing words that are present in the stop words list.
                                                                              #word.lower(): Converts the word to lowercase to ensure case-insensitive matching when filtering out stopwords.

                                                           #not in stop_words: This condition checks whether each word in the list is not present in the stop words set.
print(filtered_tokens)


['Hello', ',', 'world', '!', 'NLP', '.']


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
from nltk.stem import PorterStemmer #nltk.stem: A sub-module of the NLTK library that provides methods for stemming and lemmatization, which are text normalization techniques.
from nltk.stem import WordNetLemmatizer # WordNetLemmatizer: A lemmatizer from the nltk.stem module that reduces words to their base or dictionary form, called the lemma, while considering the context. It uses the WordNet lexical database to ensure accurate lemmatization.
nltk.download('wordnet') # WordNet is a large lexical database of English words grouped into sets of synonyms (synsets).
ps = PorterStemmer() # PorterStemmer: A stemming algorithm available in the nltk.stem module. Stemming reduces words to their root form (stem), even if the result is not a valid word (e.g., "faster" to "fast").
lemmatizer = WordNetLemmatizer()
print(ps.stem("faster")) # ps.stem("faster"): This method applies stemming to the word "faster" and reduces it to its stem form, in this case, "fast."
print(lemmatizer.lemmatize("faster")) #lemmatizer.lemmatize("faster"): This method applies lemmatization to the word "faster," considering it as a dictionary form. Depending on the context, lemmatization requires more information to produce the correct lemma (e.g., it might return "fast" in some contexts).


faster
faster


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [None]:
import pandas as pd #Pandas is used for data manipulation and analysis, particularly with tabular data like DataFrames.
import nltk
from sklearn.model_selection import train_test_split # sklearn.model_selection: A sub-module in scikit-learn that contains utilities for splitting datasets into training and testing sets, cross-validation, and more.
                                                     # train_test_split: A function in sklearn.model_selection that splits data into training and testing sets for model training and evaluation.
from sklearn.feature_extraction.text import CountVectorizer #sklearn.feature_extraction.text: A sub-module in scikit-learn that contains utilities for extracting features from text data to convert it into numeric format suitable for machine learning algorithms.
                                                            #CountVectorizer: A function in sklearn.feature_extraction.text that converts a collection of text documents into a matrix of token counts, representing the frequency of each word (or token) in the text corpus.
from sklearn.naive_bayes import MultinomialNB # sklearn.naive_bayes: A sub-module in scikit-learn that provides implementations of the Naive Bayes algorithm for classification tasks.
                                              # MultinomialNB: A specific implementation of the Naive Bayes algorithm used for classification, particularly effective when features are multinomially distributed, such as word frequencies.
from sklearn import metrics #sklearn: A popular Python library (also known as scikit-learn) that provides simple and efficient tools for data mining and data analysis, especially machine learning tasks.
                            #metrics: A sub-module of sklearn that contains various methods to evaluate the performance of machine learning models, including accuracy, precision, recall, F1 score, etc.

#In summary, this code imports essential libraries for performing machine learning tasks,particularly for text data processing, splitting the dataset, building a classification model using Naive Bayes, and evaluating the model's performance.

In [None]:
data = {
 'text': [
 'I love this movie!',
 'This was a terrible movie.',
 'I really enjoyed the film.',
 'Worst experience ever.',
 'It was fantastic!',
 'Not worth the time.',
 'Absolutely amazing!',
 'It was okay, not great.',
 'I hate this film.',
 'Best movie ever!'
 ],
 'sentiment': [
 'negative',
 'positive',
 'negative',
 'positive',
 'negative',
 'positive',
 'negative',
 'neutral',
 'negative',
 'positive'
 ]
}

In [None]:
df = pd.DataFrame(data) # DataFrame: A constructor in the pandas library that creates a DataFrame object, which is essentially a table with rows and columns, used for data manipulation and analysis.

In [None]:
print(df)

                         text sentiment
0          I love this movie!  negative
1  This was a terrible movie.  positive
2  I really enjoyed the film.  negative
3      Worst experience ever.  positive
4           It was fantastic!  negative
5         Not worth the time.  positive
6         Absolutely amazing!  negative
7     It was okay, not great.   neutral
8           I hate this film.  negative
9            Best movie ever!  positive


In [None]:
X = df['text']
y = df['sentiment']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,random_state=42)
#train_test_split: A function from sklearn.model_selection that splits the data into training and testing sets. Here, X and y are split into training data (X_train, y_train) and testing data (X_test, y_test).
#test_size=0.2: This argument specifies the proportion of the dataset to be used as the test set (20% in this case).
#random_state=42: A parameter that sets a seed for random number generation to ensure reproducibility of the split. Using the same seed will yield the same split each time the code is run.

# Vectorize the text
vectorizer = CountVectorizer() #CountVectorizer(): This initializes a CountVectorizer object from sklearn.feature_extraction.text, which will convert the text data into a matrix of token counts (bag-of-words representation).
X_train_vectorized = vectorizer.fit_transform(X_train) #fit_transform(X_train): This method fits the CountVectorizer on the training data (X_train), learning the vocabulary from the text, and transforms the training text into a matrix of token counts. The result (X_train_vectorized) is a numerical representation of the text data that can be used for machine learning.
X_test_vectorized = vectorizer.transform(X_test) # transform(X_test): This method applies the same vectorization learned from the training data to the test data (X_test). It converts the test text into a matrix of token counts using the vocabulary from the training set.

#this code prepares the text data for machine learning by splitting it into training and testing sets, then vectorizing the text using the bag-of-words model (i.e., token counts). This transforms the text data into a numerical format that can be used as input for machine learning algorithms.

In [None]:
model = MultinomialNB() # MultinomialNB(): A function that initializes the Multinomial Naive Bayes classifier. This classifier is typically used for discrete features like word counts (i.e., the bag-of-words representation). It assumes that the features follow a multinomial distribution.
model.fit(X_train_vectorized, y_train) #fit(): A method that trains the model on the provided training data. It adjusts the model's parameters based on the input features and the corresponding target labels. In this case, the model is trained using X_train_vectorized (the vectorized text data) and y_train (the sentiment labels).

In [None]:
y_pred = model.predict(X_test_vectorized) #model.predict(): A method used to make predictions using the trained model. It takes the input features (in this case, the vectorized test data) and outputs the predicted labels based on the learned parameters from the training phase.

In [None]:
accuracy = metrics.accuracy_score(y_test, y_pred) # Accuracy is the ratio of correctly predicted instances to the total instances in the test set.
                                                  #accuracy_score(): A function from the metrics module that computes the accuracy of the model's predictions. It takes the true labels (y_test) and the predicted labels (y_pred) as inputs.
confusion_matrix = metrics.confusion_matrix(y_test, y_pred) # confusion_matrix(): A function from the metrics module that computes the confusion matrix. It also takes the true labels (y_test) and the predicted labels (y_pred) as inputs.
print(f'Accuracy: {accuracy:.2f}')
print('Confusion Matrix:')
print(confusion_matrix)


Accuracy: 0.00
Confusion Matrix:
[[0 2]
 [0 0]]


In [None]:
def predict_sentiment(text):   #def: A keyword used to define a new function in Python.
                              #predict_sentiment: The name of the function being defined. This function is designed to predict the sentiment of a given text input.
 text_vectorized = vectorizer.transform([text]) #vectorizer.transform([text]): This method converts the input text into a numerical format (a vector of token counts) suitable for prediction. The input text is passed as a list to maintain the expected input shape.
 prediction = model.predict(text_vectorized)
 return prediction[0] #return: A statement that exits the function and sends back the specified value (in this case, the predicted sentiment) to the caller.
                      #prediction[0]: This retrieves the first (and only) predicted sentiment label from the prediction array. The model returns an array of predictions, but since the input is a single text, we take the first element.
# Example usage
new_text = "I loved the plot and the acting!"
print(f'Sentiment: {predict_sentiment(new_text)}')

#this code defines a function to predict the sentiment of a given text input. It vectorizes the input text, makes a prediction using the trained model, and returns the predicted sentiment. An example usage is also provided to demonstrate how to call the function and print the result.

Sentiment: negative


In [None]:
print("Hello World")

This combination allows for the procution of beautiful documents containing software, documentation and discussion. For larger codes you may wish to use Python in a stand-alone environment such as a traditional IDE. But for demonstration purposes Jupyter is a very useful tool.

Notebook files have the extension ".ipynb" extension. A Jupyter notebook is one of many environments you may run Python code.  Colab and the Jupyter notebook editor in Anaconda are two of the many pieces of software you may use to write and run a Jupyter notebook. For this course we recommend using the online Google Colab tool, but you can use Anaconda to run the notebooks on your own machine within an internet connection. On college computers, Jupyter can be used by launchng Anaconda from the Software Hub Apps Anywhere interface.

Note that exact interfaces will differ between different environments but the same functionality should be found in most environments. This course will be using the Colab environment.

## Cells and Executing Code

A notebooks is made up of one or more "cells". Cells can contain the html-like text used to generate text or code to be run by the user. A cell containing a piece of code may be recognised by the the  ```[]```  to the left of it. Code in these blocks can be run in a nubmer of ways. The simplest is click on the ```[ ]``` . This will execute the code. Try this with the code snippet below:

In [None]:
print("Yes, it worked!")

You should have seen the message "Yes, it worked!" appear immediately beneath the code. This is the output of the code, which has been printed to the screen. You may also have noticed a number appear between the square brackets to the left of the code snippet. This indicates the order in which the code snippet has been executed. Code cells may be executed in any order and variables will be saved between execution of code snippets. To try this, execute the three codes snippets below in the following order:
- 1
- 2
- 3
- 2

In [None]:
a="Message 1"

In [None]:
print(a)

In [None]:
a="Message 2"

The first time you ran code snippet 1 you should have seen "Message 1" as the output and the second time the output should have been "Message 2". This is because the first time it was run, the value assigned to the variable named "a" was "Message" as set by the first code snippet and the second time it was "Message 2" as set by the third code snippet. Note also the current numbers contained within square brackets. These help you to kno which cells have been executed and in which order.

##Sharing Jupyter Notebooks on Colab
When a Jupyter Notebook is shared with you on Colab, you will often receive access to the notebook which will alow you to run code, but not edit it. This should be the case for the notebooks that form part of this course. In this case you can select "Save a Copy in Drive" from the "File" menu to create a new copy that is yours and yo can edit.

For this course, it is reccommended that you create two copies. One of these should be the original copy without your edits, and another which you can edit to compelte exercises or expierment.

## Basic Jupyter Commands

Jupyter contains a number of useful tools for executing these cells. By using the "Runtime" menu, you can run multiple cells at a time using "Run all", "Run before", "Run selected" and "Run after".

You can clear output (this is the term for what is written under a code cell when it's executed) by clicking on the symbol to the left of it. You can clear all outputs from the notebook using the "Clear All Outputs" command on the "Edit" menu. Clearing the output will not unset variables set by the code snippets run, only remove the output printed to the screen.

To unset variables, use the "Restart Runtime" or "Reset Runtime" option in the Runtime menu. The "Interrupt Execution" command on the kernel menu will halt the procesing of code, which can be useful if you've accidentally written a piece of code that will never finish executing or if the code is taking too long to execute.

The "insert" menu allows you to create new cells. The "cell type" option in the "cell" menu allows you toggle the current cell type between the different cell types available:
- **Code**: Code snippets
- **Text**: The html-like language used to generate text, tables, equations, etc.

Alternatively, you can hover your mouse in the space after a cell and add a code or text cell there.

###Exercise

Try each of these commands from the different menus for yourself on this  notebook and ensure they behave as you would expect.

## Text Cells in Jupyter
You can include all sort so information in Jupyter text cells to obtain different effects. To see how each of the following examples is generated, double click on this cell. To return to the formatted text, run the cell.

### Headings
Headings can be generated using the hash symbol "#". The more of these there are, the smaller the heading. The sub-sub-heading above is an example.

### Tables
Tables can be created in a way similar to basic html, using the a comabination of the "|" and "-" symbols:

| This | is    |
|------|-------|
|   a  |  table|
| It's | fancy |

### Equations
Equations can be written in a way similar to LaTeX by surrouding the text with "\$" symbols:

$a=\frac{\int\limits_{0}^{\pi} \sin{(bx)} \textrm{d}x}{4}$

Don't worry if you don't understand the exact syntax used to generate this example. In your example of it in your exercise, try writing something very simple instead. If it looks like a simple algebraic expression, it will probably render how you intend.

### Code Snippets
You can write snippets of code in a text cell and they will be highlighted as if they were code written in a code cell. This can be useful for demonstrating a code feature in a textual way. For example:

```python
print ("Hello World")
```

There is not a way to run this code, it is merely normal text highlighted to look like code. The "python" which precedes the code itself tells Jupyter which language you are writing the code snippet in so it can be highlighted accorindly.

In some environments, text cells may also be referred to as "markdown" cells.

###Exercise
Try creating simple versions of each of the constructs above in a new text cell below this one.