# This is a sample Jupyter Notebook

Below is an example of a code cell. 
Put your cursor into the cell and press Shift+Enter to execute it and select the next one, or click !here goes the icon of the corresponding button in the gutter! button.
To debug a cell, press Alt+Shift+Enter, or click !here goes the icon of the corresponding button in the gutter! button.

Press Double Shift to search everywhere for classes, files, tool windows, actions, and settings.

To learn more about Jupyter Notebooks in PyCharm, see [help](https://www.jetbrains.com/help/pycharm/jupyter-notebook-support.html).
For an overview of PyCharm, go to Help -> Learn IDE features or refer to [our documentation](https://www.jetbrains.com/help/pycharm/getting-started.html).

In [2]:
import numpy as np
import pandas as pd
import itertools
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.metrics import accuracy_score, confusion_matrix

## Data reading

---
Through the pandas function ```read_csv``` the set containing all the news is assigned to the ```df``` variable for easy access

- ```.shape```:  
This is an attribute of the DataFrame that returns a tuple representing the dimensions of the DataFrame.
The tuple consists of two elements:
The first element is the number of rows (observations) in the DataFrame.
The second element is the number of columns (features) in the DataFrame.

In [3]:
#Read the data
df = pd.read_csv('data/news.csv')

#Get shape and head
df.shape
df.head()

Unnamed: 0.1,Unnamed: 0,title,text,label
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL


## Label extraction

---
```labels = df.label```:  
This line extracts the 'label' column from the DataFrame df.
df is assumed to be a Pandas DataFrame that contains various columns, one of which is named 'label'.
The extracted labels are stored in the variable labels.
The labels variable will be a Pandas Series containing the values from the 'label' column.

&nbsp;

### Displaying the First Few Entries:

---
```labels.head()```:  
This method is called on the labels Series.
It returns the first five entries of the Series by default.
This is useful for quickly inspecting the data to ensure it has been extracted correctly and to understand the distribution of labels.

In [4]:
#DataFlair - Get the labels
labels = df.label
labels.head()

0    FAKE
1    FAKE
2    REAL
3    FAKE
4    REAL
Name: label, dtype: object

## Creating training and testing subsets from news.csv

---
The ```train_test_split``` function is part of the ```sklearn.model_selection``` module in the Scikit-learn library.
It splits arrays or matrices into random train and test subsets.

&nbsp;

### Parameters:

---
1. ```df['text']```:  
This is the feature set (input data) that you want to split.
Here, df is assumed to be a Pandas DataFrame, and ['text'] indicates that you are selecting the column named 'text' which contains the text data for your model.
2. ```labels:```  
This represents the target variable (output data) that you want to predict.
It should be a list or array-like structure containing the corresponding labels for the text data.
3. ```test_size=0.2:```  
This parameter specifies the proportion of the dataset to include in the test split.
A value of 0.2 means that 20% of the data will be used for testing, while 80% will be used for training.
4. ```random_state=7:```  
This parameter controls the shuffling applied to the data before splitting.
Setting a specific integer (like 7) ensures reproducibility, meaning that every time you run the code, you will get the same split of the dataset.

&nbsp;

### Output Variables:

---
1. ```x_train:```  
This variable will contain the training set of features (text data).
2. ```x_test:```  
This variable will contain the test set of features (text data).
3. ```y_train:```  
This variable will contain the training set of labels (target variable).
4. ```y_test:```  
This variable will contain the test set of labels (target variable).

In [5]:
#DataFlair - Split the dataset
x_train, x_test, y_train, y_test = train_test_split(df['text'], labels, test_size = 0.2, random_state = 7)

## TfidfVectorizer

---
The ```TfidfVectorizer``` is a tool used in natural language processing (NLP) to convert a collection of raw documents into a matrix of TF-IDF features.   
**TF-IDF** stands for Term Frequency-Inverse Document Frequency, which is a statistical measure used to evaluate the importance of a word in a document relative to a collection of documents (corpus).

&nbsp;
In particular:
- **TF**: Term Frequency measures how often a term appears in a document. It reflects the importance of a term within that specific document.  

    &nbsp;
    
    Interpretation:  
    - A higher TF indicates that the term is more relevant to the document's content.  
    - However, TF alone does not account for the term's importance across the entire corpus, which can lead to common words being overrepresented.
- **IDF**: It measures how much information a term provides, i.e., how common or rare it is across all documents in the corpus. It helps to reduce the weight of common terms that appear in many documents.  

    &nbsp;
    
    Interpretation:  
    - A higher IDF score indicates that the term is rare across the corpus and thus more informative.
    - If a term appears in all documents, its IDF score will be low (approaching zero), indicating it is not useful for distinguishing between documents.

&nbsp;

### Parameters

---
1. ```stop_words='english'```:  
This parameter specifies that common English stop words (e.g., "the", "is", "in") should be ignored during the vectorization process.
Removing stop words helps to focus on the more meaningful words in the text, which can improve the performance of machine learning models.
2. ```max_df=0.7```:  
This parameter sets a threshold for filtering out terms that appear in a high proportion of documents.
A value of 0.7 means that any term that appears in more than 70% of the documents will be ignored.
This helps to remove common terms that are not useful for distinguishing between documents.  

&nbsp;

### Example of a TF-IDF Matrix

---
Let's consider the following three documents:

1. **Document 1**: "This is a sample document."
2. **Document 2**: "This document is another example."
3. **Document 3**: "And this is yet another document."

&nbsp;

Here’s how the TF-IDF matrix would look after processing the above documents:

| _Terms_:  | and   | another | document | example | is    | sample | this  | yet   |
|-----------|-------|---------|----------|---------|-------|--------|-------|-------|
| **Doc 1** | 0.00  | 0.00    | 0.577    | 0.00    | 0.577 | 0.577  | 0.577 | 0.00  |
| **Doc 2** | 0.00  | 0.577   | 0.577    | 0.577   | 0.577 | 0.00   | 0.577 | 0.00  |
| **Doc 3** | 0.577 | 0.577   | 0.577    | 0.00    | 0.577 | 0.00   | 0.00  | 0.577 |

### Explanation of the Matrix
- **Rows**: Each row corresponds to a document.
- **Columns**: Each column corresponds to a unique term extracted from the documents.
- **Values**: The values in the matrix represent the TF-IDF score for each term in each document. A higher score indicates that the term is more important in that document.

### Interpretation
- **Document 1**: The term "document" has a TF-IDF score of 0.577, indicating it is significant in this document. The term "sample" also has a score of 0.577, showing its importance.
- **Document 2**: The term "another" has a high score of 0.577, indicating its relevance in this document, along with "document" and "example."
- **Document 3**: The term "and" and "yet" have scores of 0.577, indicating their importance in this document.

In [6]:
#DataFlair - Initialize a TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(stop_words = 'english', max_df = 0.7)

#DataFlair - Fit and transform train set, transform test set
tfidf_train = tfidf_vectorizer.fit_transform(x_train) 
tfidf_test = tfidf_vectorizer.transform(x_test)

## PassiveAggressiveClassifier

---
The Passive Aggressive Classifier is a type of linear classifier that is particularly well-suited for large-scale learning tasks, especially in text classification problems. It is part of the Scikit-learn library and is designed to handle online learning, where the model is updated incrementally as new data arrives.

&nbsp;
### Key Characteristics

---
1. Online Learning:  
   The Passive Aggressive Classifier is designed for online learning, meaning it can update its model with each new training example rather than requiring the entire dataset to be retrained.
2. Passive and Aggressive Updates:  
   The name "Passive Aggressive" refers to the way the algorithm updates its weights
    - Passive Update: When the prediction is correct, the model does not change its weights.
    - Aggressive Update: When the prediction is incorrect, the model aggressively adjusts its weights to minimize the error.


### How It Works

---
1. **Loss Function**:  
The Passive Aggressive Classifier uses a hinge loss function, which is common in support vector machines (SVM). The hinge loss penalizes misclassifications and encourages the model to make correct predictions with a margin.

2. **Weight Update Rule**:  
If the prediction is correct, no update is made. Else, if the prediction is incorrect, the weights are updated.

3. **Regularization**:  
    The Passive Aggressive Classifier includes a regularization term to prevent overfitting. The regularization parameter (```C``` Parameter) can be adjusted to control the trade-off between fitting the training data and maintaining a simpler model.  
    This parameter is optional and inversely controls the strength of regularization. A smaller value of C applies stronger regularization, which can help prevent overfitting, while a larger value allows the model to fit the training data more closely.  
    - Code example with the regularization parameter:
        ```
            pac = PassiveAggressiveClassifier(max_iter=50, C=1.0)  # C=1.0 is the regularization strength
        ```
  
### Phases

---
1. **Initialization**:  
    ```max_iter``` is a hyperparameter that sets the upper limit on the number of iterations the optimization algorithm will perform while fitting the model to the training data.
    - **Purpose**: It helps to control the convergence of the algorithm. If the algorithm does not converge to a solution within the specified number of iterations, it will stop and return the best solution found so far.
    - **Impact**: Setting max_iter too low may result in an underfitted model if the algorithm does not have enough iterations to converge. Conversely, setting it too high may lead to unnecessary computation time without significant improvements in model performance.
2. **Model Training**:  
    The ```fit``` method is called to train the classifier using the training data (```tfidf_train```) and the corresponding labels (```y_train```).
    The model learns to associate the TF-IDF features with the labels provided.
3. **Making Predictions**:  
The ```predict``` method is used to make predictions on the test set (```tfidf_test```).
The predicted labels are stored in the variable ```y_pred```.
4. **Calculating Accuracy**:  
The ```accuracy_score``` function from ```Scikit-learn``` is used to calculate the accuracy of the model.
It compares the true labels (```y_test```) with the predicted labels (```y_pred```) and returns the proportion of correct predictions.

In [7]:
#DataFlair - Initialize a PassiveAggressiveClassifier
pac = PassiveAggressiveClassifier(max_iter=50)
pac.fit(tfidf_train, y_train)

#DataFlair - Predict on the test set and calculate accuracy
y_pred = pac.predict(tfidf_test)
score = accuracy_score(y_test, y_pred)
print(f'Accuracy: {round(score*100,2)}%')

Accuracy: 92.42%


## Confusion metrix

---
While the accuracy score is a commonly used metric for evaluating the performance of a classification model, it does not provide a complete picture of the model's performance. Here are several reasons why a confusion matrix is essential, even when you have the accuracy score:

### 1. Detailed Performance Insights
- Class-wise Performance:  
The confusion matrix breaks down the performance of the model for each class. It shows how many instances were correctly classified (True Positives and True Negatives) and how many were misclassified (False Positives and False Negatives).
- Error Analysis:  
By examining the confusion matrix, you can identify specific classes where the model is performing poorly. For example, if a model has high accuracy but consistently misclassifies a particular class, this can be crucial information for model improvement.
### 2. Imbalanced Datasets
- Accuracy Limitations:  
In cases of imbalanced datasets (where one class significantly outnumbers another), accuracy can be misleading. For instance, if 95% of the data belongs to one class, a model that predicts all instances as that class can achieve 95% accuracy, but it would not be useful.
- Confusion Matrix Utility:  
The confusion matrix provides a more nuanced view of performance, allowing you to see how well the model is performing across all classes, regardless of their distribution.
### 3. Computation of Other Metrics
The confusion matrix allows for the calculation of other important evaluation metrics, such as:
- Precision:  
The ratio of true positive predictions to the total predicted positives.
- Recall (Sensitivity):  
The ratio of true positive predictions to the total actual positives.
- F1 Score:  
The harmonic mean of precision and recall, providing a balance between the two.

These metrics can be critical for understanding the trade-offs between false positives and false negatives, especially in applications where one type of error is more costly than the other (e.g., medical diagnoses).
### 4. Visual Representation
- Easier Interpretation:  
The confusion matrix provides a visual representation of the model's performance, making it easier to interpret and communicate results to stakeholders.
- Heatmaps:  
Confusion matrices can be visualized as heatmaps, which can quickly highlight areas of concern in the model's predictions.

In [8]:
#DataFlair - Build confusion matrix
confusion_matrix(y_test, y_pred, labels=['FAKE','REAL'])

array([[585,  53],
       [ 43, 586]])