# USA Air line tweets Sentiment analysis 

We are going to analyze how travelers in February 2015 expressed their feelings on Twitter.

Dataset Download link : https://www.kaggle.com/crowdflower/twitter-airline-sentiment

We will use spark machine learning utilities, feature engineering, data preprocessing to classify tweets as bad/good experience.



In [1]:
import findspark
findspark.init()

from pyspark.sql import SparkSession 
from pyspark.ml.clustering import KMeans



Now let's create a spark session !!

In [2]:
spark = SparkSession.builder.appName('Classification of USA air line flight experience tweets').getOrCreate()

## Read dataset 

In [9]:
df = spark.read.csv('/Users/ihebd/PycharmProjects/Spark-with-machine-learning-/datasets/Tweets.csv', header=True, inferSchema=True)

Before proceeding to any machine learning application we need to explore our dataset, remove any duplicated row and replace empty cells, N/A with a value.

In [10]:
df.head()

Row(tweet_id='570306133677760513', airline_sentiment='neutral', airline_sentiment_confidence='1.0', negativereason=None, negativereason_confidence=None, airline='Virgin America', airline_sentiment_gold=None, name='cairdin', negativereason_gold=None, retweet_count=0, text='@VirginAmerica What @dhepburn said.', tweet_coord=None, tweet_created='2015-02-24 11:35:52 -0800', tweet_location=None, user_timezone='Eastern Time (US & Canada)')

In [11]:
df.printSchema()

root
 |-- tweet_id: string (nullable = true)
 |-- airline_sentiment: string (nullable = true)
 |-- airline_sentiment_confidence: string (nullable = true)
 |-- negativereason: string (nullable = true)
 |-- negativereason_confidence: string (nullable = true)
 |-- airline: string (nullable = true)
 |-- airline_sentiment_gold: string (nullable = true)
 |-- name: string (nullable = true)
 |-- negativereason_gold: string (nullable = true)
 |-- retweet_count: integer (nullable = true)
 |-- text: string (nullable = true)
 |-- tweet_coord: string (nullable = true)
 |-- tweet_created: string (nullable = true)
 |-- tweet_location: string (nullable = true)
 |-- user_timezone: string (nullable = true)



In [12]:
df.describe()

DataFrame[summary: string, tweet_id: string, airline_sentiment: string, airline_sentiment_confidence: string, negativereason: string, negativereason_confidence: string, airline: string, airline_sentiment_gold: string, name: string, negativereason_gold: string, retweet_count: string, text: string, tweet_coord: string, tweet_created: string, tweet_location: string, user_timezone: string]

count missing values in my dataset 

In [17]:
from pyspark.sql.functions import isnan, when, count, col

df.select([count(when(isnan(c), c)).alias(c) for c in df.columns]).show()


+--------+-----------------+----------------------------+--------------+-------------------------+-------+----------------------+----+-------------------+-------------+----+-----------+-------------+--------------+-------------+
|tweet_id|airline_sentiment|airline_sentiment_confidence|negativereason|negativereason_confidence|airline|airline_sentiment_gold|name|negativereason_gold|retweet_count|text|tweet_coord|tweet_created|tweet_location|user_timezone|
+--------+-----------------+----------------------------+--------------+-------------------------+-------+----------------------+----+-------------------+-------------+----+-----------+-------------+--------------+-------------+
|       0|                0|                           0|             0|                        0|      0|                     0|   0|                  0|            0|   0|          0|            0|             0|            0|
+--------+-----------------+----------------------------+--------------+------------

count null values in my dataset

In [18]:
from pyspark.sql.functions import isnan, when, count, col
df.select([count(when(col(c).isNull(), c)).alias(c) for c in df.columns]).show()

+--------+-----------------+----------------------------+--------------+-------------------------+-------+----------------------+----+-------------------+-------------+----+-----------+-------------+--------------+-------------+
|tweet_id|airline_sentiment|airline_sentiment_confidence|negativereason|negativereason_confidence|airline|airline_sentiment_gold|name|negativereason_gold|retweet_count|text|tweet_coord|tweet_created|tweet_location|user_timezone|
+--------+-----------------+----------------------------+--------------+-------------------------+-------+----------------------+----+-------------------+-------------+----+-----------+-------------+--------------+-------------+
|       0|              155|                          68|          5573|                     4229|    179|                 14788| 196|              14805|          205| 205|      13768|          389|          5010|         5103|
+--------+-----------------+----------------------------+--------------+------------

In [19]:
df.select('airline_sentiment_gold').show()

+----------------------+
|airline_sentiment_gold|
+----------------------+
|                  null|
|                  null|
|                  null|
|                  null|
|                  null|
|                  null|
|                  null|
|                  null|
|                  null|
|                  null|
|                  null|
|                  null|
|                  null|
|                  null|
|                  null|
|                  null|
|                  null|
|                  null|
|                  null|
|                  null|
+----------------------+
only showing top 20 rows



we can notice that this column is empty, Therefor we will drop it 

In [23]:
df = df.drop('airline_sentiment_gold')
df = df.drop('negativereason_gold')

In [25]:
df.printSchema()

root
 |-- tweet_id: string (nullable = true)
 |-- airline_sentiment: string (nullable = true)
 |-- airline_sentiment_confidence: string (nullable = true)
 |-- negativereason: string (nullable = true)
 |-- negativereason_confidence: string (nullable = true)
 |-- airline: string (nullable = true)
 |-- name: string (nullable = true)
 |-- retweet_count: integer (nullable = true)
 |-- text: string (nullable = true)
 |-- tweet_coord: string (nullable = true)
 |-- tweet_created: string (nullable = true)
 |-- tweet_location: string (nullable = true)
 |-- user_timezone: string (nullable = true)



In [27]:
df.select([count(when(col(c).isNull(), c)).alias(c) for c in df.columns]).show()

+--------+-----------------+----------------------------+--------------+-------------------------+-------+----+-------------+----+-----------+-------------+--------------+-------------+
|tweet_id|airline_sentiment|airline_sentiment_confidence|negativereason|negativereason_confidence|airline|name|retweet_count|text|tweet_coord|tweet_created|tweet_location|user_timezone|
+--------+-----------------+----------------------------+--------------+-------------------------+-------+----+-------------+----+-----------+-------------+--------------+-------------+
|       0|              155|                          68|          5573|                     4229|    179| 196|          205| 205|      13768|          389|          5010|         5103|
+--------+-----------------+----------------------------+--------------+-------------------------+-------+----+-------------+----+-----------+-------------+--------------+-------------+

