 Installing the Libraries


In [None]:
pip install pyspark xgboost


Collecting pyspark
  Downloading pyspark-3.5.1.tar.gz (317.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m317.0/317.0 MB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.5.1-py2.py3-none-any.whl size=317488491 sha256=67bff200352d12431485dbdb5d1346035b6533327414d6ebf07caefe31d49e8a
  Stored in directory: /root/.cache/pip/wheels/80/1d/60/2c256ed38dddce2fdd93be545214a63e02fbd8d74fb0b7f3a6
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.5.1


**This command installs two libraries:**

**pyspark:** This is the Python API for Apache Spark, a powerful open-source distributed computing framework used for big data processing and analytics.

**xgboost:** This stands for eXtreme Gradient Boosting, a scalable and accurate implementation of gradient boosting machines.


**Understanding PySpark**

**What is PySpark?**


PySpark allows you to interface with Apache Spark through Python. Apache Spark is a fast, distributed processing system that enables efficient processing of large datasets.

**Understanding XGBoost**

**What is XGBoost?**

XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible, and portable. It provides parallel tree boosting and is the leading machine learning library for regression, classification, and ranking problems.

In [None]:
pip install sparkxgb

Collecting sparkxgb
  Downloading sparkxgb-0.1.tar.gz (3.6 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting pyspark==3.1.1 (from sparkxgb)
  Downloading pyspark-3.1.1.tar.gz (212.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.3/212.3 MB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting py4j==0.10.9 (from pyspark==3.1.1->sparkxgb)
  Downloading py4j-0.10.9-py2.py3-none-any.whl (198 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m198.6/198.6 kB[0m [31m21.7 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: sparkxgb, pyspark
  Building wheel for sparkxgb (setup.py) ... [?25l[?25hdone
  Created wheel for sparkxgb: filename=sparkxgb-0.1-py3-none-any.whl size=5629 sha256=4d96ca8ebbc3ad4cfb6c8e50951de1fc582d6e6d9c2117d211b539860cf01518
  Stored in directory: /root/.cache/pip/wheels/b7/0c/a1/786408e13056fabeb8a72134e101b1e142fc95905c7b0e2

By running pip install sparkxgb, you install the library that integrates XGBoost with Spark, allowing you to leverage the distributed computing power of Spark for training XGBoost models.

The example demonstrates the steps to create a Spark session, define and train an XGBoost model, and make predictions on a dataset.

# Understanding sparkxgb

sparkxgb integrates XGBoost with Spark, enabling you to use the powerful machine learning capabilities of XGBoost within the distributed computing environment of Spark. This can be particularly useful when working with large datasets that do not fit into the memory of a single machine.

In [None]:
pip install emoji

Collecting emoji
  Downloading emoji-2.12.1-py3-none-any.whl (431 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/431.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m153.6/431.4 kB[0m [31m4.4 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m431.4/431.4 kB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: emoji
Successfully installed emoji-2.12.1


The emoji library is a handy tool for working with emojis in Python. By installing the library using pip install emoji, you gain access to functions that allow you to add emojis to strings, replace emoji aliases, convert emojis back to their aliases, and check for the presence of emojis in strings.

 This can be particularly useful for creating more engaging and expressive text in applications such as chatbots, social media tools, and more.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


This line imports the'drive' module from the google.colab package. The google.colab package provides utilities for using Google Colab, and the drive module within it specifically deals with Google Drive integration.

This line mounts your Google Drive to the specified directory (/'content/drive') within the Colab environment.

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col , count, when
from pyspark.ml.feature import CountVectorizer, StringIndexer
from pyspark.ml.feature import VectorAssembler
from pyspark.ml import Pipeline
import pandas as pd
import xgboost as xgb

# Initialize Spark session
spark = SparkSession.builder.appName("SarcasmDetection").getOrCreate()

# Load the dataset
df = spark.read.csv("/content/drive/MyDrive/iSarcasm/isarcasm.csv", header=True, inferSchema=True)

# Display the schema
df.printSchema()

# Display the data
df.show()

root
 |-- _c0: string (nullable = true)
 |-- tweet: string (nullable = true)
 |-- sarcastic: string (nullable = true)
 |-- rephrase: string (nullable = true)
 |-- sarcasm: string (nullable = true)
 |-- irony: string (nullable = true)
 |-- satire: string (nullable = true)
 |-- understatement: string (nullable = true)
 |-- overstatement: string (nullable = true)
 |-- rhetorical_question: string (nullable = true)

+---+--------------------+--------------------+--------------------+--------------------+-----+------+--------------+-------------+-------------------+
|_c0|               tweet|           sarcastic|            rephrase|             sarcasm|irony|satire|understatement|overstatement|rhetorical_question|
+---+--------------------+--------------------+--------------------+--------------------+-----+------+--------------+-------------+-------------------+
|  0|The only thing I ...|                   1|College is really...|                   0|    1|     0|             0|            

**Spark Libraries:**

**SparkSession:** Entry point to Spark functionality.
col, count, when: Functions for DataFrame operations.
CountVectorizer, StringIndexer, VectorAssembler: Machine learning feature transformers.

**Pipeline:** For creating ML pipelines.
Other Libraries:

**pandas:** Data manipulation library.

**xgboost:**  Machine learning library for gradient boosting.

* This line initializes a Spark session named "SarcasmDetection". It is necessary to interact with Spark's APIs.


* This reads a **CSV file** (isarcasm.csv) from Google Drive into a Spark DataFrame.

* **header=True:** Indicates that the first row of the CSV file contains column names.

* **inferSchema=True:** Automatically infers the data types of the columns.

**df.printSchema():**Prints the schema of the DataFrame to show the structure and data types of each column.

**df.show():**Displays the first 20 rows of the DataFrame. This is useful for getting a quick look at the data.

In [None]:
df.count()

3834

**df.count():** It provides the total number of rows in the DataFrame, which is useful for understanding the size of the dataset.

In [None]:
df.columns


['_c0',
 'tweet',
 'sarcastic',
 'rephrase',
 'sarcasm',
 'irony',
 'satire',
 'understatement',
 'overstatement',
 'rhetorical_question']

**df.columns:** It provides a list of all column names in the DataFrame, which helps in understanding the structure of your dataset.

In [None]:
null_counts = df.agg(*[count(when(col(i).isNull(), i)).alias(i) for i in df.columns])

null_counts.show()

+---+-----+---------+--------+-------+-----+------+--------------+-------------+-------------------+
|_c0|tweet|sarcastic|rephrase|sarcasm|irony|satire|understatement|overstatement|rhetorical_question|
+---+-----+---------+--------+-------+-----+------+--------------+-------------+-------------------+
|  0|  116|      495|    2932|   2959| 2964|  2966|          2971|         2971|               3012|
+---+-----+---------+--------+-------+-----+------+--------------+-------------+-------------------+



**df.columns:** Retrieves a list of all column names in the DataFrame.
col(i).isNull(): Checks if a column i has null values.

**when(col(i).isNull(), i): **Creates a condition to identify null values in column i.

**count(when(col(i).isNull(), i)):** Counts the number of null values in column i.

**alias(i):** Renames the result of the count operation to the column name i.

[count(when(col(i).isNull(), i)).alias(i) for i in df.columns]: Creates a list of expressions to count null values for each column.

df.agg(*...):Aggregates these expressions over the DataFrame to get the count of null values for each column.

In [None]:
df = df.drop('_c0','rephrase','overstatement','understatement','rhetorical_question')

**To remove unwanted columns from the DataFrame df.**

**Effect:** The DataFrame df will no longer contain the columns '_c0', 'rephrase',' understatement', 'overstatement', and 'rhetorical_question'.

In [None]:
df=df.na.drop(how="all")

**Purpose:** To remove rows from the DataFrame where all values are null.

**Effect:** Cleans the DataFrame by eliminating rows that do not contain any meaningful data (i.e., rows with all null values).

In [None]:
df=df.na.drop(subset='tweet')

**Purpose:** To remove rows from the DataFrame where the value in the 'tweet' column is null.

**Effect:** Ensures that the DataFrame only contains rows where the 'tweet' column has valid (non-null) values.

In [None]:
df=df.na.drop(subset='sarcastic')

**Purpose:** To remove rows from the DataFrame where the value in the 'sarcastic' column is null.

**Effect:** Ensures that the DataFrame only contains rows where the 'sarcastic' column has valid (non-null) values.

In [None]:
df=df.na.fill('0')

**df:** This is your DataFrame that may contain null values.

**na:** This accesses the na (null value handling) functions of the DataFrame.

**fill('0'):** This function replaces all null values in the DataFrame with the specified value, which in this case is the string '0'.

In [None]:
df=df.na.fill('0')

In [None]:
df=df.na.fill('0')

In [None]:
null_counts = df.agg(*[count(when(col(i).isNull(), i)).alias(i) for i in df.columns])

null_counts.show()

+-----+---------+-------+-----+------+
|tweet|sarcastic|sarcasm|irony|satire|
+-----+---------+-------+-----+------+
|    0|        0|      0|    0|     0|
+-----+---------+-------+-----+------+



**df.columns:** Retrieves a list of all column names in the DataFrame.

**[count(when(col(i).isNull(), i)).alias(i) for i in df.columns]:** This is a list comprehension that creates a list of aggregation expressions. For each column i:

* col(i).isNull(): Checks if the column value is null.

* when(col(i).isNull(), i): Returns the column name if the value is null.

* count(when(col(i).isNull(), i)): Counts the number of null values in the column.

* .alias(i): Assigns the count result an alias, which is the column name.

df.agg(*[...]): **bold text** Aggregates the DataFrame based on the list of aggregation expressions.

**null_counts:** The resulting DataFrame containing the count of null values for each column.

**null_counts.show():** Displays the DataFrame with null counts for each column.

In [None]:
df.count()

3338

In [None]:
from pyspark.sql.functions import regexp_extract, col

# Regex pattern for matching emojis
emoji_pattern = "[\U0001F600-\U0001F64F\U0001F300-\U0001F5FF\U0001F680-\U0001F6FF\U0001F700-\U0001F77F\U0001F780-\U0001F7FF\U0001F800-\U0001F8FF\U0001F900-\U0001F9FF\U0001FA00-\U0001FA6F\U0001FA70-\U0001FAFF\U00002702-\U000027B0\U000024C2-\U0001F251]+"

# Extract emojis from the text column
df = df.withColumn("emojis", regexp_extract(col("tweet"), emoji_pattern, 0))

df.show()


+--------------------+--------------------+--------------------+-----+------+------+
|               tweet|           sarcastic|             sarcasm|irony|satire|emojis|
+--------------------+--------------------+--------------------+-----+------+------+
|The only thing I ...|                   1|                   0|    1|     0|      |
|I love it when pr...|                   1|                   1|    0|     0|    ツ|
|Remember the hund...|                   1|                   0|    1|     0|🥰🙌🏼|
|Today my pop-pop ...|                   1|                   1|    0|     0|    🙃|
|@VolphanCarol @li...|                   1|                   1|    0|     0|      |
|"@jimrossignol I ...| poor folks in Ub...|It's a terrible n...|    0|     1|      |
|Why would Alexa's...|                   1|                   0|    1|     0|      |
|someone hit me w ...|                   1|                   1|    0|     0|      |
|Loving season 4 o...|                   1|                   1|    0|

df.withColumn("emojis", regexp_extract(col("tweet"), emoji_pattern, 0)):

Adds a new column named "emojis" to the DataFrame df.
Uses regexp_extract to search for occurrences of the emoji_pattern in the "tweet" column (col("tweet")).

The 0 parameter specifies to extract the entire matched substring (the emoji sequence) from the "tweet" column.

Results are stored in the new "emojis" column.

In [None]:
import emoji
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

# Define a UDF that uses the emoji library to convert emojis to text
def emoji_to_text(text):
    return emoji.demojize(text)

emoji_to_text_udf = udf(emoji_to_text, StringType())

# Apply the UDF to your DataFrame column
df = df.withColumn("text", emoji_to_text_udf("tweet"))
df.show()

+--------------------+--------------------+--------------------+-----+------+------+--------------------+
|               tweet|           sarcastic|             sarcasm|irony|satire|emojis|                text|
+--------------------+--------------------+--------------------+-----+------+------+--------------------+
|The only thing I ...|                   1|                   0|    1|     0|      |The only thing I ...|
|I love it when pr...|                   1|                   1|    0|     0|    ツ|I love it when pr...|
|Remember the hund...|                   1|                   0|    1|     0|🥰🙌🏼|Remember the hund...|
|Today my pop-pop ...|                   1|                   1|    0|     0|    🙃|Today my pop-pop ...|
|@VolphanCarol @li...|                   1|                   1|    0|     0|      |@VolphanCarol @li...|
|"@jimrossignol I ...| poor folks in Ub...|It's a terrible n...|    0|     1|      |"@jimrossignol I ...|
|Why would Alexa's...|                   1|        

**udf:** Stands for User Defined Function. It allows you to define custom functions that can be applied to DataFrame columns.

**StringType:** Represents string data type in PySpark.

**emoji_to_text:** Python function that takes text as input and uses emoji.demojize to convert emojis in text to their textual representation.

**udf(emoji_to_text, StringType()):** Creates a PySpark UDF (UserDefinedFunction) named emoji_to_text_udf. It specifies that the output type is StringType.

df.withColumn("text", emoji_to_text_udf("tweet")): Creates a new column "text" in the DataFrame df.

Applies the emoji_to_text_udf UDF to the "tweet" column of df, converting emojis to their textual representation in the new "text" column.

In [None]:
from pyspark.sql.functions import regexp_replace, col

# Regex pattern for matching all special characters
special_char_pattern = "[^a-zA-Z0-9\s]"

# Remove special characters from the text column
df = df.withColumn("clean_text", regexp_replace(col("text"), special_char_pattern, ""))

df.show()


+--------------------+--------------------+--------------------+-----+------+------+--------------------+--------------------+
|               tweet|           sarcastic|             sarcasm|irony|satire|emojis|                text|          clean_text|
+--------------------+--------------------+--------------------+-----+------+------+--------------------+--------------------+
|The only thing I ...|                   1|                   0|    1|     0|      |The only thing I ...|The only thing I ...|
|I love it when pr...|                   1|                   1|    0|     0|    ツ|I love it when pr...|I love it when pr...|
|Remember the hund...|                   1|                   0|    1|     0|🥰🙌🏼|Remember the hund...|Remember the hund...|
|Today my pop-pop ...|                   1|                   1|    0|     0|    🙃|Today my pop-pop ...|Today my poppop t...|
|@VolphanCarol @li...|                   1|                   1|    0|     0|      |@VolphanCarol @li...|VolphanCaro

**regexp_replace:** PySpark SQL function used to replace substrings that match a regex pattern with a specified string.

**special_char_pattern:** Regular expression pattern that matches all characters except letters ('a-zA-Z'), digits ('0-9'), and whitespace ('\s').

**df.withColumn**("clean_text", regexp_replace(col("text"), special_char_pattern, "")):

**Creates a new column** "clean_text" in the DataFrame df.

Applies regexp_replace to the "text" column (col("text")), replacing all substrings that match special_char_pattern with an empty string ("").

**Stores** the result in the new column "clean_text".

In [None]:
from pyspark.ml.feature import Tokenizer
from pyspark.ml.feature import StopWordsRemover

# Tokenize the text column into words
tokenizer = Tokenizer(inputCol="clean_text", outputCol="words")
df = tokenizer.transform(df)

# Remove stopwords
remover = StopWordsRemover(inputCol="words", outputCol="filtered")
df = remover.transform(df)

# Show the resulting DataFrame with stopwords removed
df.show()


+--------------------+--------------------+--------------------+-----+------+------+--------------------+--------------------+--------------------+--------------------+
|               tweet|           sarcastic|             sarcasm|irony|satire|emojis|                text|          clean_text|               words|            filtered|
+--------------------+--------------------+--------------------+-----+------+------+--------------------+--------------------+--------------------+--------------------+
|The only thing I ...|                   1|                   0|    1|     0|      |The only thing I ...|The only thing I ...|[the, only, thing...|[thing, got, coll...|
|I love it when pr...|                   1|                   1|    0|     0|    ツ|I love it when pr...|I love it when pr...|[i, love, it, whe...|[love, professors...|
|Remember the hund...|                   1|                   0|    1|     0|🥰🙌🏼|Remember the hund...|Remember the hund...|[remember, the, h...|[remember, h

**Tokenizer:** A feature transformer in PySpark ML that splits text into words (tokens) based on whitespace or specified patterns.

**StopWordsRemover: **A feature transformer in PySpark ML that removes common words (stopwords) from a sequence of words.

**Tokenizer(inputCol="clean_text", outputCol="words"):**
Creates a Tokenizer instance that will tokenize the "clean_text" column (inputCol="clean_text").

Outputs the tokens into a new column named "words" (outputCol="words").

**tokenizer.transform(df):** Transforms the DataFrame df by applying the Tokenizer to create a new column "words" containing tokenized text.

StopWordsRemover(inputCol="words", outputCol="filtered"):
Creates a StopWordsRemover instance that removes stopwords from the "words" column (inputCol="words").
* Outputs the filtered words into a new column named "filtered" (outputCol="filtered").

**remover.transform(df):** Transforms the DataFrame df by applying the StopWordsRemover to create a new column "filtered" containing words with stopwords removed.

In [None]:
stringIndexer1 = StringIndexer(inputCol="sarcastic", outputCol="label")
stringIndexer2 = StringIndexer(inputCol="sarcasm", outputCol="newsarcasm")
stringIndexer3 = StringIndexer(inputCol="irony", outputCol="newirony")
stringIndexer4 = StringIndexer(inputCol="satire", outputCol="newsatire")
countVectorizer = CountVectorizer(inputCol="filtered", outputCol="features")

vectorAssembler = VectorAssembler(inputCols=["features","newsarcasm","newirony","newsatire"], outputCol="combinedFeatures")

# Define the pipeline
pipeline = Pipeline(stages=[stringIndexer1, stringIndexer2, stringIndexer3, stringIndexer4, countVectorizer, vectorAssembler])

# Fit the pipeline to the DataFrame
model = pipeline.fit(df)

# Transform the DataFrame
df = model.transform(df)

df.show()

+--------------------+--------------------+--------------------+-----+------+------+--------------------+--------------------+--------------------+--------------------+-----+----------+--------+---------+--------------------+--------------------+
|               tweet|           sarcastic|             sarcasm|irony|satire|emojis|                text|          clean_text|               words|            filtered|label|newsarcasm|newirony|newsatire|            features|    combinedFeatures|
+--------------------+--------------------+--------------------+-----+------+------+--------------------+--------------------+--------------------+--------------------+-----+----------+--------+---------+--------------------+--------------------+
|The only thing I ...|                   1|                   0|    1|     0|      |The only thing I ...|The only thing I ...|[the, only, thing...|[thing, got, coll...|  1.0|       0.0|     1.0|      0.0|(10223,[19,42,419...|(10226,[19,42,419...|
|I love it w

**StringIndexer:** A feature transformer in PySpark ML that indexes categorical labels into numerical labels.

**CountVectorizer:** A feature transformer in PySpark ML that converts a collection of text documents into a sparse matrix of token counts.

**VectorAssembler:** A transformer in PySpark ML that combines multiple columns into a single vector column.

**Pipeline:** A sequence of stages to process and learn from data in a structured way.

Creates a CountVectorizer instance to convert the "filtered" column (containing words after stopwords removal) into a feature vector.

Outputs a sparse vector representation of word counts into a new column named "features".

**Fits the defined pipeline to the DataFrame df, applying each stage sequentially.**

Transforms the DataFrame df using the fitted pipeline (model), applying all stages (StringIndexer, CountVectorizer, VectorAssembler).

In [None]:
df.printSchema()

root
 |-- tweet: string (nullable = false)
 |-- sarcastic: string (nullable = false)
 |-- sarcasm: string (nullable = false)
 |-- irony: string (nullable = false)
 |-- satire: string (nullable = false)
 |-- emojis: string (nullable = false)
 |-- text: string (nullable = true)
 |-- clean_text: string (nullable = true)
 |-- words: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- filtered: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- label: double (nullable = false)
 |-- newsarcasm: double (nullable = false)
 |-- newirony: double (nullable = false)
 |-- newsatire: double (nullable = false)
 |-- features: vector (nullable = true)
 |-- combinedFeatures: vector (nullable = true)



**Purpose:** To understand the structure of the DataFrame df and its columns.

**Output:** Describes each column with its name, data type, and nullability.

In [None]:
df=df.filter(df.label <= 1)
df=df.filter(df.newsarcasm <= 1)
df=df.filter(df.newirony <= 1)
df=df.filter(df.newsatire <= 1)

Purpose: To filter rows in the DataFrame based on a condition involving the "label", "newsarcasm", "newirony", "newsatire" column.

In [None]:
df.count()

3230

In [None]:
df.show()

+--------------------+---------+-------+-----+------+------+--------------------+--------------------+--------------------+--------------------+-----+----------+--------+---------+--------------------+--------------------+
|               tweet|sarcastic|sarcasm|irony|satire|emojis|                text|          clean_text|               words|            filtered|label|newsarcasm|newirony|newsatire|            features|    combinedFeatures|
+--------------------+---------+-------+-----+------+------+--------------------+--------------------+--------------------+--------------------+-----+----------+--------+---------+--------------------+--------------------+
|The only thing I ...|        1|      0|    1|     0|      |The only thing I ...|The only thing I ...|[the, only, thing...|[thing, got, coll...|  1.0|       0.0|     1.0|      0.0|(10223,[19,42,419...|(10226,[19,42,419...|
|I love it when pr...|        1|      1|    0|     0|    ツ|I love it when pr...|I love it when pr...|[i, lov

In [None]:
df.groupBy("label").count().show()

+-----+-----+
|label|count|
+-----+-----+
|  0.0| 2435|
|  1.0|  795|
+-----+-----+




The code df.groupBy("label").count().show() groups the DataFrame df by the label column, counts the rows in each group, and displays the result.

In [None]:
df.groupBy("newsarcasm").count().show()

+----------+-----+
|newsarcasm|count|
+----------+-----+
|       0.0| 2578|
|       1.0|  652|
+----------+-----+




The code df.groupBy("newsarcasm").count().show() groups the DataFrame df by the newsarcasm column, counts the rows in each group, and displays the result.

In [None]:
df.groupBy("newirony").count().show()

+--------+-----+
|newirony|count|
+--------+-----+
|     0.0| 3090|
|     1.0|  140|
+--------+-----+




The code df.groupBy("newirony").count().show() groups the DataFrame df by the newirony column, counts the rows in each group, and displays the result.

In [None]:
df.groupBy("newsatire").count().show()

+---------+-----+
|newsatire|count|
+---------+-----+
|      0.0| 3208|
|      1.0|   22|
+---------+-----+




The code df.groupBy("newsatire").count().show() groups the DataFrame df by the newsatire column, counts the rows in each group, and displays the result.

In [None]:
finalized_data = df.select("combinedFeatures", "label")


The code finalized_data = df.select("combinedFeatures", "label") creates a new DataFrame finalized_data by selecting only the combinedFeatures and label columns from the original DataFrame df.

In [None]:
finalized_data.show()

+--------------------+-----+
|    combinedFeatures|label|
+--------------------+-----+
|(10226,[19,42,419...|  1.0|
|(10226,[0,2,3,8,5...|  1.0|
|(10226,[3,10,43,5...|  1.0|
|(10226,[14,20,116...|  1.0|
|(10226,[52,106,10...|  1.0|
|(10226,[1,1789,34...|  1.0|
|(10226,[0,3,10,17...|  1.0|
|(10226,[140,151,2...|  1.0|
|(10226,[1,93,124,...|  1.0|
|(10226,[25,41,74,...|  1.0|
|(10226,[16,23,68,...|  1.0|
|(10226,[2,14,44,5...|  1.0|
|(10226,[6,1850,18...|  1.0|
|(10226,[6,18,27,3...|  1.0|
|(10226,[2,5,7,28,...|  1.0|
|(10226,[16,63,233...|  1.0|
|(10226,[2,11,20,5...|  1.0|
|(10226,[82,686,25...|  1.0|
|(10226,[25,37,321...|  1.0|
|(10226,[16,60,140...|  1.0|
+--------------------+-----+
only showing top 20 rows



In [None]:
finalized_data.groupBy("label").count().show()

+-----+-----+
|label|count|
+-----+-----+
|  0.0| 2435|
|  1.0|  795|
+-----+-----+



The code finalized_data.groupBy("label").count().show() groups the finalized_data DataFrame by the label column, counts the rows in each group, and displays the result.

In [None]:
# Split the data into train and test sets
train_data, test_data = finalized_data.randomSplit([0.8, 0.2])


The code train_data, test_data = finalized_data.randomSplit([0.8, 0.2]) splits the finalized_data DataFrame into two subsets train_data and test_data with approximately 80% of the data assigned to train_data and 20% to test_data, randomly distributed.

In [None]:
train_data.groupBy().count().show()

+-----+
|count|
+-----+
| 2580|
+-----+



The code train_data.groupBy().count().show() groups the train_data DataFrame and counts the total number of rows in the DataFrame, displaying the count.








In [None]:
test_data.groupBy().count().show()

+-----+
|count|
+-----+
|  650|
+-----+



In [None]:
import numpy as np

In [None]:
# Convert to pandas DataFrame
train_pd = train_data.select("combinedFeatures", "label").toPandas()
test_pd = test_data.select("combinedFeatures", "label").toPandas()

**train_data.select("combinedFeatures", "label"):** Selects only the columns "combinedFeatures" and "label" from the train_data DataFrame.

**.toPandas():** Converts the PySpark DataFrame train_data (or test_data in the second line) into a Pandas DataFrame (train_pd or test_pd respectively). This conversion allows for easier handling and analysis using Pandas functionalities, which are commonly used in Python data analysis workflows.

In [None]:
train_data.groupBy("label").count().show()

+-----+-----+
|label|count|
+-----+-----+
|  0.0| 1928|
|  1.0|  652|
+-----+-----+



In [None]:
test_data.groupBy("label").count().show()

+-----+-----+
|label|count|
+-----+-----+
|  0.0|  507|
|  1.0|  143|
+-----+-----+



In [None]:
train_pd.columns

Index(['combinedFeatures', 'label'], dtype='object')

In [None]:
print(train_pd.dtypes)

combinedFeatures     object
label               float64
dtype: object


In [None]:
# Convert 'combinedFeatures' to 'category' if it's categorical
train_pd['combinedFeatures'] = train_pd['combinedFeatures'].astype('category')
test_pd['combinedFeatures'] = test_pd['combinedFeatures'].astype('category')
# Check the dtypes again
print(train_pd.dtypes)


combinedFeatures    category
label                float64
dtype: object


These lines convert the column 'combinedFeatures' in both the train_pd and test_pd Pandas DataFrames to a categorical data type. This is useful when 'combinedFeatures' represents categorical data (e.g., predefined categories or labels).

After converting 'combinedFeatures' to categorical, this line prints the data types of all columns in the train_pd DataFrame. It helps to verify that 'combinedFeatures' is indeed of type 'category' after the conversion.

In [None]:
# Use the Pandas equivalent for grouping and counting
train_pd.groupby("label").size()

label
0.0    1928
1.0     652
dtype: int64

In [None]:
xgb_model = xgb.XGBClassifier(objective='binary:logistic', random_state=42, enable_categorical=True)
xgb_model.fit(train_pd.drop('label', axis=1), train_pd['label'])



**xgb.XGBClassifier:** Initializes an instance of the XGBoost classifier.

**objective='binary:logistic':** Specifies the objective function for binary classification using logistic regression.

**random_state=42:** Sets a seed for reproducibility.

**enable_categorical=True:** Enables usage of categorical features in XGBoost (typically useful when features are categorical and have been encoded appropriately).

fit(train_pd.drop('label', axis=1), train_pd['label']): Fits (trains) the XGBoost model (xgb_model) using training data.

**train_pd.drop('label', axis=1):** Drops the 'label' column from train_pd, leaving only the feature columns.

**train_pd['label']:** Provides the target variable (labels) for training, which is expected to be binary (0 or 1).

In [None]:
# Prepare DMatrix
dtrain = xgb.DMatrix(train_pd['combinedFeatures'].tolist(), label=train_pd['label'])
dtest = xgb.DMatrix(test_pd['combinedFeatures'].tolist(), label=test_pd['label'])

# Set parameters for XGBoost
params = {
    'objective': 'binary:logistic',
    'max_depth': 5,
    'eta': 0.1,
    'eval_metric': 'logloss'
}

# Train the model
bst = xgb.train(params, dtrain, num_boost_round=100, evals=[(dtest, 'test')], early_stopping_rounds=10)

[0]	test-logloss:0.45143
[1]	test-logloss:0.38924
[2]	test-logloss:0.33937
[3]	test-logloss:0.29820
[4]	test-logloss:0.26347
[5]	test-logloss:0.23372
[6]	test-logloss:0.20815
[7]	test-logloss:0.18589
[8]	test-logloss:0.16644
[9]	test-logloss:0.14926
[10]	test-logloss:0.13416
[11]	test-logloss:0.12079
[12]	test-logloss:0.10891
[13]	test-logloss:0.09833
[14]	test-logloss:0.08890
[15]	test-logloss:0.08047
[16]	test-logloss:0.07279
[17]	test-logloss:0.06590
[18]	test-logloss:0.05973
[19]	test-logloss:0.05416
[20]	test-logloss:0.04916
[21]	test-logloss:0.04467
[22]	test-logloss:0.04064
[23]	test-logloss:0.03696
[24]	test-logloss:0.03367
[25]	test-logloss:0.03068
[26]	test-logloss:0.02797
[27]	test-logloss:0.02556
[28]	test-logloss:0.02339
[29]	test-logloss:0.02142
[30]	test-logloss:0.01965
[31]	test-logloss:0.01803
[32]	test-logloss:0.01657
[33]	test-logloss:0.01526
[34]	test-logloss:0.01408
[35]	test-logloss:0.01301
[36]	test-logloss:0.01204
[37]	test-logloss:0.01116
[38]	test-logloss:0.01

**xgb.DMatrix(...):** Converts data into an internal data structure (DMatrix) used by XGBoost.

**train_pd['combinedFeatures'].tolist():** Converts the 'combinedFeatures' column from the pandas DataFrame train_pd to a list format.

**label=train_pd['label']:** Specifies the labels for training (train_pd['label']).

Similar preparation is done for the test data (test_pd).

**params:** Dictionary containing parameters for configuring the XGBoost model.

**objective='binary:logistic':** Specifies binary classification using logistic regression.

**max_depth:** Maximum depth of a tree.

eta: Learning rate.

**eval_metric='logloss':** Metric used for evaluation during training.

**xgb.train(...):** Trains an XGBoost model.

**params:** Parameters for the model as defined earlier.

**dtrain:** Training data (DMatrix object).

num_boost_round=100: Number of boosting rounds (iterations).

**evals=[(dtest, 'test')]:** Specifies evaluation data (dtest) and a name ('test') for it.

**early_stopping_rounds=10:** Stops training if performance on evals does not improve for 10 rounds.

In [None]:
# Predict
preds = bst.predict(dtest)
predictions = [1 if x > 0.5 else 0 for x in preds]

# Convert predictions and true labels to a Spark DataFrame
pred_df = pd.DataFrame({'prediction': predictions, 'label': test_pd['label']})
pred_spark_df = spark.createDataFrame(pred_df.itertuples(index=False), schema=['prediction', 'label'])
print(pred_spark_df)
print(pred_df)


# Evaluate the accuracy
correct_predictions = pred_spark_df.filter(pred_spark_df.prediction == pred_spark_df.label).count()
total_data = pred_spark_df.count()
accuracy = correct_predictions / total_data * 100

print(f"Accuracy: {accuracy}")

DataFrame[prediction: bigint, label: double]
     prediction  label
0             0    0.0
1             0    0.0
2             0    0.0
3             1    1.0
4             0    0.0
..          ...    ...
645           0    0.0
646           0    0.0
647           0    0.0
648           0    0.0
649           0    0.0

[650 rows x 2 columns]
Accuracy: 100.0


**bst.predict(dtest):** Predicts probabilities using the trained XGBoost model bst on the test data (dtest).

Converts the pandas DataFrame pred_df to a Spark DataFrame pred_spark_df using spark.createDataFrame().

itertuples(index=False) iterates through rows of pred_df without including the index.

Evaluates the accuracy by counting the number of correct predictions where 'prediction' matches 'label'.

Calculates the total number of predictions (total_data).

Computes the accuracy as a percentage (accuracy).

In [None]:
# Assuming 'prediction' is the column you want to group by
pred_spark_df.groupBy("prediction").count().show()

+----------+-----+
|prediction|count|
+----------+-----+
|         0|  507|
|         1|  143|
+----------+-----+



The code pred_spark_df.groupBy("prediction").count().show() groups the DataFrame pred_spark_df by the column named "prediction", counts the number of rows in each group, and displays the result. This is typically used to analyze the distribution of predictions or clusters generated by a machine learning model or algorithm.

## **Mathamatical Explanation of XGBoost**


XGBoost (Extreme Gradient Boosting) is an ensemble learning method that has gained popularity for its effectiveness in various machine learning tasks, particularly in structured/tabular data problems. Here's a brief overview of the mathematical background of XGBoost using LaTeX notation:

Objective Function
The objective of XGBoost is to minimize a regularized objective function, which can be generally expressed as:

𝐿
(
𝜙
)
=
∑
𝑖
=
1
𝑛
𝑙
(
𝑦
𝑖
,
𝑦
^
𝑖
)
+
∑
𝑘
=
1
𝐾
Ω
(
𝑓
𝑘
)
L(ϕ)=∑
i=1
n
​
 l(y
i
​
 ,
y
^
​
  
i
​
 )+∑
k=1
K
​
 Ω(f
k
​
 )

where:

𝐿
(
𝜙
)
L(ϕ) is the objective function to be minimized.
𝑛
n is the number of training instances.
𝑙
(
𝑦
𝑖
,
𝑦
^
𝑖
)
l(y
i
​
 ,
y
^
​
  
i
​
 ) is the loss function that measures the difference between the true label
𝑦
𝑖
y
i
​
  and the predicted label
𝑦
^
𝑖
y
^
​
  
i
​
 .
𝐾
K is the number of trees (boosting rounds).
𝑓
𝑘
f
k
​
  represents the
𝑘
k-th tree in the ensemble.
Ω
(
𝑓
𝑘
)
Ω(f
k
​
 ) is the regularization term that penalizes complexity of the model.
Loss Function and Regularization
The loss function
𝑙
(
𝑦
𝑖
,
𝑦
^
𝑖
)
l(y
i
​
 ,
y
^
​
  
i
​
 ) typically depends on the specific problem (regression, classification, etc.) and can include:

Regression:

Squared Error:
𝑙
(
𝑦
𝑖
,
𝑦
^
𝑖
)
=
(
𝑦
𝑖
−
𝑦
^
𝑖
)
2
l(y
i
​
 ,
y
^
​
  
i
​
 )=(y
i
​
 −
y
^
​
  
i
​
 )
2

Binary Classification:

Logistic Loss:
𝑙
(
𝑦
𝑖
,
𝑦
^
𝑖
)
=
log
⁡
(
1
+
exp
⁡
(
−
𝑦
𝑖
⋅
𝑦
^
𝑖
)
)
l(y
i
​
 ,
y
^
​
  
i
​
 )=log(1+exp(−y
i
​
 ⋅
y
^
​
  
i
​
 ))
Multiclass Classification:

Softmax Loss:
𝑙
(
𝑦
𝑖
,
𝑦
^
𝑖
)
=
−
∑
𝑗
𝑦
𝑖
𝑗
log
⁡
(
𝑦
^
𝑖
𝑗
)
l(y
i
​
 ,
y
^
​
  
i
​
 )=−∑
j
​
 y
ij
​
 log(
y
^
​
  
ij
​
 )
Tree Ensemble
XGBoost builds an ensemble of decision trees sequentially. Each tree
𝑓
𝑘
f
k
​
  is trained to correct the residuals of the previous tree and is added to the ensemble. The prediction
𝑦
^
𝑖
y
^
​
  
i
​
  is computed as:

𝑦
^
𝑖
=
𝜙
0
(
𝑥
𝑖
)
+
∑
𝑘
=
1
𝐾
𝑓
𝑘
(
𝑥
𝑖
)
y
^
​
  
i
​
 =ϕ
0
​
 (x
i
​
 )+∑
k=1
K
​
 f
k
​
 (x
i
​
 )

where
𝜙
0
(
𝑥
𝑖
)
ϕ
0
​
 (x
i
​
 ) is the initial prediction and
𝑓
𝑘
(
𝑥
𝑖
)
f
k
​
 (x
i
​
 ) are the predictions from each tree in the ensemble.

Regularization Terms
The regularization term
Ω
(
𝑓
𝑘
)
Ω(f
k
​
 ) typically includes:

Tree Complexity Regularization:

Ω
(
𝑓
𝑘
)
=
𝛾
𝑇
+
1
2
𝜆
∥
𝑤
∥
2
2
Ω(f
k
​
 )=γT+
2
1
​
 λ∥w∥
2
2
​

𝑇
T is the number of leaves in the tree.
𝑤
w are the weights associated with each leaf.
Sparsity Regularization:

Encourages sparse solutions.