# USA Air line tweets Sentiment analysis 

We are going to analyze how travelers in February 2015 expressed their feelings on Twitter.

Dataset Download link : https://www.kaggle.com/crowdflower/twitter-airline-sentiment

We will use spark machine learning utilities, feature engineering, data preprocessing to classify tweets as bad/good experience.



In [1]:
import findspark
findspark.init()

from pyspark.sql import SparkSession 
from pyspark.ml.clustering import KMeans



Now let's create a spark session !!

In [2]:
spark = SparkSession.builder.appName('Classification of USA air line flight experience tweets').getOrCreate()

### Read dataset 

In [3]:
df = spark.read.csv('/Users/ihebd/PycharmProjects/Spark-with-machine-learning-/datasets/Tweets.csv', header=True, inferSchema=True)

Before proceeding to any machine learning application we need to explore our dataset, remove any duplicated row and replace empty cells, N/A with a value.

In [5]:
df.head(5)

[Row(tweet_id='570306133677760513', airline_sentiment='neutral', airline_sentiment_confidence='1.0', negativereason=None, negativereason_confidence=None, airline='Virgin America', airline_sentiment_gold=None, name='cairdin', negativereason_gold=None, retweet_count=0, text='@VirginAmerica What @dhepburn said.', tweet_coord=None, tweet_created='2015-02-24 11:35:52 -0800', tweet_location=None, user_timezone='Eastern Time (US & Canada)'),
 Row(tweet_id='570301130888122368', airline_sentiment='positive', airline_sentiment_confidence='0.3486', negativereason=None, negativereason_confidence='0.0', airline='Virgin America', airline_sentiment_gold=None, name='jnardino', negativereason_gold=None, retweet_count=0, text="@VirginAmerica plus you've added commercials to the experience... tacky.", tweet_coord=None, tweet_created='2015-02-24 11:15:59 -0800', tweet_location=None, user_timezone='Pacific Time (US & Canada)'),
 Row(tweet_id='570301083672813571', airline_sentiment='neutral', airline_sentim

In [6]:
df.printSchema()

root
 |-- tweet_id: string (nullable = true)
 |-- airline_sentiment: string (nullable = true)
 |-- airline_sentiment_confidence: string (nullable = true)
 |-- negativereason: string (nullable = true)
 |-- negativereason_confidence: string (nullable = true)
 |-- airline: string (nullable = true)
 |-- airline_sentiment_gold: string (nullable = true)
 |-- name: string (nullable = true)
 |-- negativereason_gold: string (nullable = true)
 |-- retweet_count: integer (nullable = true)
 |-- text: string (nullable = true)
 |-- tweet_coord: string (nullable = true)
 |-- tweet_created: string (nullable = true)
 |-- tweet_location: string (nullable = true)
 |-- user_timezone: string (nullable = true)



### Dealing with categorical variable 

Before we can move forward to need to deal with our categorical variables within our dataset.

In [17]:
df.groupBy('airline_sentiment').count().show()

+--------------------+-----+
|   airline_sentiment|count|
+--------------------+-----+
|[35.23185283, -80...|    1|
|   ubetter do smth!"|    1|
|            positive| 2363|
| we had a good ru...|    1|
| never submits. F...|    1|
|[40.7740308, -73....|    1|
|                   0|    8|
|     please????????"|    1|
|                null|  155|
| this is where Ce...|    1|
| flight AA1469 2/...|    1|
|[51.44284934, -0....|    1|
| or just days tha...|    1|
|[40.65062011, -73...|    1|
|             neutral| 3099|
| this is where Ce...|    2|
|            negative| 9178|
| and I might choo...|    1|
|            Virginia|    1|
|          [0.0, 0.0]|    1|
+--------------------+-----+
only showing top 20 rows



In [27]:
df.groupBy('negativereason').count().show()

+--------------------+-----+
|      negativereason|count|
+--------------------+-----+
|2015-02-21 14:57:...|    1|
|        Florida, USA|    2|
|   Dallas Fort Worth|    1|
|Toronto (formerly...|    1|
|ÜT: 42.798909,-71...|    1|
|        Lost Luggage|  724|
|Blue Ridge High S...|    2|
|      New York City |    1|
|           longlines|  178|
|  bonkers in Yonkers|    1|
|        Columbus, OH|    1|
|         Late Flight| 1665|
|     Pocos de Caldas|    1|
| rented van drove...|    1|
|         Los Angeles|    1|
|2015-02-20 08:53:...|    1|
|freyabevanfund@ho...|    2|
|ÜT: 34.078171,-11...|    1|
|      Louisville, KY|    1|
|              ottawa|    1|
+--------------------+-----+
only showing top 20 rows



### Dealing with missing/null values

count missing values in my dataset 

In [9]:
from pyspark.sql.functions import isnan, when, count, col

df.select([count(when(isnan(c), c)).alias(c) for c in df.columns]).show()


+--------+-----------------+----------------------------+--------------+-------------------------+-------+----------------------+----+-------------------+-------------+----+-----------+-------------+--------------+-------------+
|tweet_id|airline_sentiment|airline_sentiment_confidence|negativereason|negativereason_confidence|airline|airline_sentiment_gold|name|negativereason_gold|retweet_count|text|tweet_coord|tweet_created|tweet_location|user_timezone|
+--------+-----------------+----------------------------+--------------+-------------------------+-------+----------------------+----+-------------------+-------------+----+-----------+-------------+--------------+-------------+
|       0|                0|                           0|             0|                        0|      0|                     0|   0|                  0|            0|   0|          0|            0|             0|            0|
+--------+-----------------+----------------------------+--------------+------------

count null values in my dataset

In [14]:
from pyspark.sql.functions import isnan, when, count, col
df.select([count(when(col(c).isNull(), c)).alias(c) for c in df.columns]).show()

+--------+-----------------+----------------------------+--------------+-------------------------+-------+----------------------+----+-------------------+-------------+----+-----------+-------------+--------------+-------------+
|tweet_id|airline_sentiment|airline_sentiment_confidence|negativereason|negativereason_confidence|airline|airline_sentiment_gold|name|negativereason_gold|retweet_count|text|tweet_coord|tweet_created|tweet_location|user_timezone|
+--------+-----------------+----------------------------+--------------+-------------------------+-------+----------------------+----+-------------------+-------------+----+-----------+-------------+--------------+-------------+
|       0|              155|                          68|          5573|                     4229|    179|                 14788| 196|              14805|          205| 205|      13768|          389|          5010|         5103|
+--------+-----------------+----------------------------+--------------+------------

In [15]:
df.select('airline_sentiment_gold').show()

+----------------------+
|airline_sentiment_gold|
+----------------------+
|                  null|
|                  null|
|                  null|
|                  null|
|                  null|
|                  null|
|                  null|
|                  null|
|                  null|
|                  null|
|                  null|
|                  null|
|                  null|
|                  null|
|                  null|
|                  null|
|                  null|
|                  null|
|                  null|
|                  null|
+----------------------+
only showing top 20 rows



### Deal

we can notice that this column is empty, Therefor we will drop it 

In [23]:
#df = df.drop('airline_sentiment_gold')
#df = df.drop('negativereason_gold')

In [25]:
#df.printSchema()

root
 |-- tweet_id: string (nullable = true)
 |-- airline_sentiment: string (nullable = true)
 |-- airline_sentiment_confidence: string (nullable = true)
 |-- negativereason: string (nullable = true)
 |-- negativereason_confidence: string (nullable = true)
 |-- airline: string (nullable = true)
 |-- name: string (nullable = true)
 |-- retweet_count: integer (nullable = true)
 |-- text: string (nullable = true)
 |-- tweet_coord: string (nullable = true)
 |-- tweet_created: string (nullable = true)
 |-- tweet_location: string (nullable = true)
 |-- user_timezone: string (nullable = true)



In [27]:
#df.select([count(when(col(c).isNull(), c)).alias(c) for c in df.columns]).show()

+--------+-----------------+----------------------------+--------------+-------------------------+-------+----+-------------+----+-----------+-------------+--------------+-------------+
|tweet_id|airline_sentiment|airline_sentiment_confidence|negativereason|negativereason_confidence|airline|name|retweet_count|text|tweet_coord|tweet_created|tweet_location|user_timezone|
+--------+-----------------+----------------------------+--------------+-------------------------+-------+----+-------------+----+-----------+-------------+--------------+-------------+
|       0|              155|                          68|          5573|                     4229|    179| 196|          205| 205|      13768|          389|          5010|         5103|
+--------+-----------------+----------------------------+--------------+-------------------------+-------+----+-------------+----+-----------+-------------+--------------+-------------+

