### Big_Data_Project in AWS Glue Environment


# AWS Glue Studio Notebook
##### You are now running a AWS Glue Studio notebook; To start using your notebook you need to start an AWS Glue Interactive Session.


#### Optional: Run this cell to see available notebook commands ("magics").


In [19]:
%help


# Available Magic Commands

## Sessions Magic

----
    %help                             Return a list of descriptions and input types for all magic commands. 
    %profile            String        Specify a profile in your aws configuration to use as the credentials provider.
    %region             String        Specify the AWS region in which to initialize a session. 
                                      Default from ~/.aws/config on Linux or macOS, 
                                      or C:\Users\ USERNAME \.aws\config" on Windows.
    %idle_timeout       Int           The number of minutes of inactivity after which a session will timeout. 
                                      Default: 2880 minutes (48 hours).
    %session_id_prefix  String        Define a String that will precede all session IDs in the format 
                                      [session_id_prefix]-[session_id]. If a session ID is not provided,
                                      a random UUID will be generated.
    %status                           Returns the status of the current Glue session including its duration, 
                                      configuration and executing user / role.
    %session_id                       Returns the session ID for the running session. 
    %list_sessions                    Lists all currently running sessions by ID.
    %stop_session                     Stops the current session.
    %glue_version       String        The version of Glue to be used by this session. 
                                      Currently, the only valid options are 2.0 and 3.0. 
                                      Default: 2.0.
----

## Selecting Job Types

----
    %streaming          String        Sets the session type to Glue Streaming.
    %etl                String        Sets the session type to Glue ETL.
    %glue_ray           String        Sets the session type to Glue Ray.
----

## Glue Config Magic 
*(common across all job types)*

----

    %%configure         Dictionary    A json-formatted dictionary consisting of all configuration parameters for 
                                      a session. Each parameter can be specified here or through individual magics.
    %iam_role           String        Specify an IAM role ARN to execute your session with.
                                      Default from ~/.aws/config on Linux or macOS, 
                                      or C:\Users\%USERNAME%\.aws\config` on Windows.
    %number_of_workers  int           The number of workers of a defined worker_type that are allocated 
                                      when a session runs.
                                      Default: 5.
    %additional_python_modules  List  Comma separated list of additional Python modules to include in your cluster 
                                      (can be from Pypi or S3).
----

                                      
## Magic for Spark Jobs (ETL & Streaming)

----
    %worker_type        String        Set the type of instances the session will use as workers. 
                                      ETL and Streaming support G.1X and G.2X. 
                                      Default: G.1X.
    %connections        List          Specify a comma separated list of connections to use in the session.
    %extra_py_files     List          Comma separated list of additional Python files From S3.
    %extra_jars         List          Comma separated list of additional Jars to include in the cluster.
    %spark_conf         String        Specify custom spark configurations for your session. 
                                      E.g. %spark_conf spark.serializer=org.apache.spark.serializer.KryoSerializer
----
                                      
## Magic for Ray Job

----
    %min_workers        Int           The minimum number of workers that are allocated to a Ray job. 
                                      Default: 1.
    %object_memory_head Int           The percentage of free memory on the instance head node after a warm start. 
                                      Minimum: 0. Maximum: 100.
    %object_memory_worker Int         The percentage of free memory on the instance worker nodes after a warm start. 
                                      Minimum: 0. Maximum: 100.
----

## Action Magic

----

    %%sql               String        Run SQL code. All lines after the initial %%sql magic will be passed
                                      as part of the SQL code.  
----



####  Run this cell to set up and start your interactive session.


In [2]:
%idle_timeout 20
%glue_version 3.0
%worker_type G.1X
%number_of_workers 2

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
  
sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)

You are already connected to a glueetl session de5500ae-409e-406e-a3f9-7b4cd0b83ff3.

No change will be made to the current session that is set as glueetl. The session configuration change will apply to newly created sessions.


Current idle_timeout is 20 minutes.
idle_timeout has been set to 20 minutes.


You are already connected to a glueetl session de5500ae-409e-406e-a3f9-7b4cd0b83ff3.

No change will be made to the current session that is set as glueetl. The session configuration change will apply to newly created sessions.


Setting Glue version to: 3.0


You are already connected to a glueetl session de5500ae-409e-406e-a3f9-7b4cd0b83ff3.

No change will be made to the current session that is set as glueetl. The session configuration change will apply to newly created sessions.


Previous worker type: G.1X
Setting new worker type to: G.1X


You are already connected to a glueetl session de5500ae-409e-406e-a3f9-7b4cd0b83ff3.

No change will be made to the current session that is set as glueetl. The session configuration change will apply to newly created sessions.


Previous number of workers: 2
Setting new number of workers to: 2



In [3]:
from awsglue.dynamicframe import DynamicFrame




#### Example: Create a DynamicFrame from a table in the AWS Glue Data Catalog and display its schema


In [4]:
dyf = glueContext.create_dynamic_frame.from_options( 's3',  {'paths': ['s3://myoyebucket/input/']}, 'csv', {'withHeader': True})
dyf.printSchema()

root
|-- : string
|-- Date: string
|-- Text: string
|-- NumberOfcounts: string
|-- NumberofLike:: string
|-- NumofRetweet: string
|-- user: string
|-- Tinubu: string
|-- Peter_Obi: string
|-- PDP: string
|-- APC: string
|-- LP: string
|-- Atiku: string
|-- reprocessed: string
|-- reprocessed text: string
|-- compound: string
|-- postive: string
|-- negative: string
|-- nuetral: string


#### Example: Convert the DynamicFrame to a Spark DataFrame and display a sample of the data


In [5]:
df = dyf.toDF()
df.show()

+---+--------------------+--------------------+--------------+-------------+------------+--------------------+------+---------+---+---+---+-----+--------------------+--------------------+--------+-------+--------+-------+
|   |                Date|                Text|NumberOfcounts|NumberofLike:|NumofRetweet|                user|Tinubu|Peter_Obi|PDP|APC| LP|Atiku|         reprocessed|    reprocessed text|compound|postive|negative|nuetral|
+---+--------------------+--------------------+--------------+-------------+------------+--------------------+------+---------+---+---+---+-----+--------------------+--------------------+--------+-------+--------+-------+
|  0|2023-03-09 10:45:...|@Observer_kay @Du...|             0|            1|           0|https://twitter.c...|      |         |   |   |   |     |observer_kay duke...|observer_kay duke...|     0.0|    0.0|     0.0|    1.0|
|  1|2023-03-09 10:45:...|@B0lutife @oluwas...|             4|            3|           1|https://twitter.c...|  

#### Example: Write the data in the DynamicFrame to a location in Amazon S3 and a table for it in the AWS Glue Data Catalog


In [6]:
output_df = df.filter(df['Peter_Obi']=='Peter Obi')
output_df.show()

+---+--------------------+--------------------+--------------+-------------+------------+--------------------+------+---------+---+---+---+-----+--------------------+--------------------+--------+-------+--------+-------+
|   |                Date|                Text|NumberOfcounts|NumberofLike:|NumofRetweet|                user|Tinubu|Peter_Obi|PDP|APC| LP|Atiku|         reprocessed|    reprocessed text|compound|postive|negative|nuetral|
+---+--------------------+--------------------+--------------+-------------+------------+--------------------+------+---------+---+---+---+-----+--------------------+--------------------+--------+-------+--------+-------+
|  1|2023-03-09 10:45:...|@B0lutife @oluwas...|             4|            3|           1|https://twitter.c...|      |Peter Obi|   |   |   |     |b0lutife oluwaseg...|b0lutife oluwaseg...|     0.0|    0.0|     0.0|    1.0|
|  2|2023-03-09 10:44:...|@Nwalove3 @Lotus_...|             0|            0|           0|https://twitter.c...|  

In [7]:
#Convert from Spark Data Frame to Glue Dynamic Frame
dyfCustomersConvert = DynamicFrame.fromDF(output_df, glueContext, "convert")

#Show converted Glue Dynamic Frame
dyfCustomersConvert.show()

{"": "1", "Date": "2023-03-09 10:45:00+00:00", "Text": "@B0lutife @oluwasegunadeb The election of 18th March 2023, will throw up so many realignment. They are under the illusions that it was only their votes that gave Peter Obi victory during the presidential elections. Can they try what they are trying in the SW in the North?", "NumberOfcounts": "4", "NumberofLike:": "3", "NumofRetweet": "1", "user": "https://twitter.com/ctpnetwo", "Tinubu": "", "Peter_Obi": "Peter Obi", "PDP": "", "APC": "", "LP": "", "Atiku": "", "reprocessed": "b0lutife oluwasegunadeb election 18th march 2023 , throw many realignment . illusion vote gave peter obi victory presidential election . try trying sw north ?", "reprocessed text": "b0lutife oluwasegunadeb election 18th march 2023 , throw many realignment . illusion vote gave peter obi victory presidential election . try trying sw north ?", "compound": "0.0", "postive": "0.0", "negative": "0.0", "nuetral": "1.0"}
{"": "2", "Date": "2023-03-09 10:44:52+00:00"

In [8]:
output_df=output_df.select(['reprocessed text','NumberOfcounts','NumberofLike:','compound','negative','postive','nuetral'])
output_df.show()

+--------------------+--------------+-------------+--------+--------+-------+-------+
|    reprocessed text|NumberOfcounts|NumberofLike:|compound|negative|postive|nuetral|
+--------------------+--------------+-------------+--------+--------+-------+-------+
|b0lutife oluwaseg...|             4|            3|     0.0|     0.0|    0.0|    1.0|
|nwalove3 lotus_fl...|             0|            0|     0.0|     0.0|    0.0|    1.0|
|renoomokri campai...|             0|            0|  0.2263|   0.182|   0.24|  0.578|
|glanced girl medi...|             5|           42|    -0.0|   0.174|  0.198|  0.628|
|shehusani sir fel...|             0|            0|  0.6124|     0.0|  0.263|  0.737|
|cbngov_akin1 ayoo...|             0|            2|  0.1027|   0.201|  0.211|  0.588|
|didynne ibksports...|             0|            0|     0.0|     0.0|    0.0|    1.0|
|chimaobi_nteoma l...|             0|            5|  0.4939|     0.0|  0.302|  0.698|
|dailypostngr may ...|             0|            0| -0

In [9]:
output_df=output_df.drop('NumberOfcounts','NumberofLike:')
output_df.show()

+--------------------+--------+--------+-------+-------+
|    reprocessed text|compound|negative|postive|nuetral|
+--------------------+--------+--------+-------+-------+
|b0lutife oluwaseg...|     0.0|     0.0|    0.0|    1.0|
|nwalove3 lotus_fl...|     0.0|     0.0|    0.0|    1.0|
|renoomokri campai...|  0.2263|   0.182|   0.24|  0.578|
|glanced girl medi...|    -0.0|   0.174|  0.198|  0.628|
|shehusani sir fel...|  0.6124|     0.0|  0.263|  0.737|
|cbngov_akin1 ayoo...|  0.1027|   0.201|  0.211|  0.588|
|didynne ibksports...|     0.0|     0.0|    0.0|    1.0|
|chimaobi_nteoma l...|  0.4939|     0.0|  0.302|  0.698|
|dailypostngr may ...| -0.7203|   0.381|  0.221|  0.397|
|emmy4life02 chiji...|  0.7711|   0.056|  0.232|  0.712|
|obi coming ! 's p...|  0.4389|   0.093|  0.203|  0.704|
|peter obi support...|  0.6369|     0.0|  0.321|  0.679|
|okada rider feder...|  0.8718|     0.0|  0.287|  0.713|
|lmao : face_with_...|  0.2003|   0.205|  0.184|  0.612|
|drowsyrebel state...|  0.3054|

In [10]:
output_df=output_df.filter('compound!=0.0').select('reprocessed text','compound')
output_df.show()

+--------------------+--------+
|    reprocessed text|compound|
+--------------------+--------+
|renoomokri campai...|  0.2263|
|shehusani sir fel...|  0.6124|
|cbngov_akin1 ayoo...|  0.1027|
|chimaobi_nteoma l...|  0.4939|
|dailypostngr may ...| -0.7203|
|emmy4life02 chiji...|  0.7711|
|obi coming ! 's p...|  0.4389|
|peter obi support...|  0.6369|
|okada rider feder...|  0.8718|
|lmao : face_with_...|  0.2003|
|drowsyrebel state...|  0.3054|
|fbcruze annie_xox...|   0.296|
|election , sharin...|   0.636|
|structure 6 senat...|  0.4382|
|drama tinubu arra...|  0.1531|
|e peter obi never...|  0.2023|
|renoomokri peter ...|  0.3612|
|avoid tempering d...|  -0.296|
|renoomokri sir ob...|  0.2732|
|i_am_deomoluabi t...|  -0.784|
+--------------------+--------+
only showing top 20 rows


In [11]:
output_df.count()

4247


In [12]:
from pyspark.sql.functions import when
from pyspark.sql.functions import col




In [13]:
output_df=output_df.withColumn('compound', col('compound').cast('float'))
output_df=output_df.na.drop()




In [14]:
output_df=output_df.withColumn('sentiment', when(output_df.compound > 0, 1).otherwise(0))
output_df.show()

+--------------------+--------+---------+
|    reprocessed text|compound|sentiment|
+--------------------+--------+---------+
|renoomokri campai...|  0.2263|        1|
|shehusani sir fel...|  0.6124|        1|
|cbngov_akin1 ayoo...|  0.1027|        1|
|chimaobi_nteoma l...|  0.4939|        1|
|dailypostngr may ...| -0.7203|        0|
|emmy4life02 chiji...|  0.7711|        1|
|obi coming ! 's p...|  0.4389|        1|
|peter obi support...|  0.6369|        1|
|okada rider feder...|  0.8718|        1|
|lmao : face_with_...|  0.2003|        1|
|drowsyrebel state...|  0.3054|        1|
|fbcruze annie_xox...|   0.296|        1|
|election , sharin...|   0.636|        1|
|structure 6 senat...|  0.4382|        1|
|drama tinubu arra...|  0.1531|        1|
|e peter obi never...|  0.2023|        1|
|renoomokri peter ...|  0.3612|        1|
|avoid tempering d...|  -0.296|        0|
|renoomokri sir ob...|  0.2732|        1|
|i_am_deomoluabi t...|  -0.784|        0|
+--------------------+--------+---

In [15]:
output_df=output_df.drop('compound')
output_df.show()

+--------------------+---------+
|    reprocessed text|sentiment|
+--------------------+---------+
|renoomokri campai...|        1|
|shehusani sir fel...|        1|
|cbngov_akin1 ayoo...|        1|
|chimaobi_nteoma l...|        1|
|dailypostngr may ...|        0|
|emmy4life02 chiji...|        1|
|obi coming ! 's p...|        1|
|peter obi support...|        1|
|okada rider feder...|        1|
|lmao : face_with_...|        1|
|drowsyrebel state...|        1|
|fbcruze annie_xox...|        1|
|election , sharin...|        1|
|structure 6 senat...|        1|
|drama tinubu arra...|        1|
|e peter obi never...|        1|
|renoomokri peter ...|        1|
|avoid tempering d...|        0|
|renoomokri sir ob...|        1|
|i_am_deomoluabi t...|        0|
+--------------------+---------+
only showing top 20 rows


In [16]:
#Convert from Spark Data Frame to Glue Dynamic Frame
dyfCustomersConvert = DynamicFrame.fromDF(output_df, glueContext, "convert")

#Show converted Glue Dynamic Frame
dyfCustomersConvert.show()

{"reprocessed text": "renoomokri campaign election yet obi still living rent-free head . brother winning election hor senate attack peter obi . sure everything still ok . see see madness", "sentiment": 1}
{"reprocessed text": "shehusani sir fellowing tweet 's lately truth 's ( peter obi ) people wish normal really sane country", "sentiment": 1}
{"reprocessed text": "cbngov_akin1 ayooyalowo woye1 fkeyamo realffk officialabat officialdssng equityoyo instablog9ja mr_jags journalist_mind lol . cornfused apc miscreant cease amaze u . someone obviously s , foolishly disguising igbo man attack peter obi . try , see truth lie .", "sentiment": 1}
{"reprocessed text": "chimaobi_nteoma leonard50310709 markessien apc guy keep playing people . learn ? allow peter obi legal team job", "sentiment": 1}
{"reprocessed text": "dailypostngr may god forgive . 're deception personified . atiku n't rig abi ? know peter obi . incapable fraud !", "sentiment": 0}
{"reprocessed text": "emmy4life02 chijioke_edeog

In [17]:
glueContext.write_dynamic_frame.from_options(frame = dyfCustomersConvert,
                                             connection_type = "s3",
                                             connection_options = {"path": "s3://myoyebucket/output"},
                                             format = "csv")

<awsglue.dynamicframe.DynamicFrame object at 0x7ff581f2dc50>


In [18]:
data=output_df.toPandas()
data.head(5)

                                    reprocessed text  sentiment
0  renoomokri campaign election yet obi still liv...          1
1  shehusani sir fellowing tweet 's lately truth ...          1
2  cbngov_akin1 ayooyalowo woye1 fkeyamo realffk ...          1
3  chimaobi_nteoma leonard50310709 markessien apc...          1
4  dailypostngr may god forgive . 're deception p...          0


In [19]:
def remove_punc(text):
    final=''.join(u for u in  text if u not in('?',',','.',':','!','"'))
    return final




In [20]:
data['final text']=data['reprocessed text'].apply(remove_punc)
data.head(5)

                                    reprocessed text  ...                                         final text
0  renoomokri campaign election yet obi still liv...  ...  renoomokri campaign election yet obi still liv...
1  shehusani sir fellowing tweet 's lately truth ...  ...  shehusani sir fellowing tweet 's lately truth ...
2  cbngov_akin1 ayooyalowo woye1 fkeyamo realffk ...  ...  cbngov_akin1 ayooyalowo woye1 fkeyamo realffk ...
3  chimaobi_nteoma leonard50310709 markessien apc...  ...  chimaobi_nteoma leonard50310709 markessien apc...
4  dailypostngr may god forgive . 're deception p...  ...  dailypostngr may god forgive  're deception pe...

[5 rows x 3 columns]


In [21]:
data.drop('reprocessed text',axis=1,inplace =True)
data.head(5)

   sentiment                                         final text
0          1  renoomokri campaign election yet obi still liv...
1          1  shehusani sir fellowing tweet 's lately truth ...
2          1  cbngov_akin1 ayooyalowo woye1 fkeyamo realffk ...
3          1  chimaobi_nteoma leonard50310709 markessien apc...
4          0  dailypostngr may god forgive  're deception pe...


In [22]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_curve, auc,confusion_matrix
from sklearn.feature_extraction.text import CountVectorizer





In [23]:
y=data['sentiment']
X=data['final text']




In [24]:
vectorizer = CountVectorizer()




In [25]:
vectorizer.fit(X)

X=vectorizer.transform(X)




In [26]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)




In [27]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression

# Example training data
X_train = X
y_train = y # Binary labels

# Create a CountVectorizer object and fit it to the training data


vectorizer = CountVectorizer()
vectorizer.fit(X_train)
X = vectorizer.fit_transform(X_train)

# Train a logistic regression model

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)



clf = LogisticRegression()
clf.fit(X_train, y_train)

LogisticRegression(random_state=42, solver='liblinear')


In [29]:
y_pred = model.predict(X_test)  # make predictions on test data
acc = accuracy_score(y_test, y_pred)  # calculate accuracy score
prec = precision_score(y_test, y_pred)  # calculate precision score
rec = recall_score(y_test, y_pred)  # calculate recall score
f1 = f1_score(y_test, y_pred)  # calculate F1 score
fpr, tpr, thresholds = roc_curve(y_test, y_pred)  # calculate ROC curve and AUC score
auc_score = auc(fpr, tpr)
print('Accuracy:',acc)
print('Precision score:',prec)

Accuracy: 0.7752941176470588
Precision score: 0.8067226890756303


In [30]:
confusion_matrix(y_pred,y_test)

array([[179,  76],
       [115, 480]])
