#  **Flights_Analysis_in_2015** 
Mengran Tang (December 2017)
___

This notebook analyze the **flights distribution** across US in 2015 for different major airlines. Then it develop a **lightGBM regression model** to predict flights delays after analyzing them in order to give advice on reducing delays. Finally, it solve a **natural language processing problem** using recurrent neural network for airlines sentiment analysis accoring to the Twitter comments.

____
From a **_technical point of view_**, the main aspects of python covered throughout the notebook are:
- **visualization**: matplolib, seaborn, plotly
- **data manipulation**: pandas, numpy
- **modeling**: keras, LightGBM
- **problem**: regression, natural language processing
___
Two parts of data are involved in this notebook. First, the flight delay data was collected and published by the DOT's Bureau of Transportation Statistics, which tracks the on-time performance of domestic flights operated by large air carriers. Second, the Twitter sentiment data came from Crowdflower's Data for Everyone library, which is scraped from February of 2015 and contributors were asked to first classify positive, negative, and neutral tweets, followed by categorizing negative reasons (such as "late flight" or "rude service")

Plotly is the main tool for data visualization in order to get beautiful pictures as well as interactivity. Clicking and draging on the graph can provide detail information on plotly graph. LightGBM has been proven that it is not just quick but also accurate compare to other methods such as XGBoost, so I think it is good for this regression problem. As for dealing with the Twitter, Natural language processing is needed. Because it is well known that RNN is widely used in NLP such as translate, using keras to implement a RNN model to solve this sentiment analysis problem should be fine. <br>
___

# Table of Contents
This notebook is composed of three parts: Flights and Airlines data visualization (section 1), Delay Analysis and Predicting (section 2) and Twitter Sentiment Analysis (section 3).

* [** _Preamble_:** _overview of the datasets_](#0) <br>


* [**1. Flights and Airlines Data Visualization**](#1) <br>
   * [1.1 Flights Overall Situation in US](#11) <br>
   * [1.2 Flights Distribution in Airlines and Cities](#12) <br>
   * [1.3 Flights Among States](#13) <br>
   * [1.4 Monthly Changes of Flights](#14) <br>
   * [1.5 Flights Distance Over City and Airline](#15) <br>
   * [1.6 Canceled Flights Situation](#16) <br>
<br>
* [**2. Delay Analysis and Predicting**](#2) <br>
   * [2.1 Different Delays Correlation](#21) <br>
   * [2.2 Basic Statistical Description of Delays](#22) <br>
   * [2.3 Delays Heatmap](#23) <br>
   * [2.4 Delays Distribution](#24) <br>
   * [2.5 Delays Type](#25) <br>
   * [2.6 Speed Analysis](#26) <br>
   * [2.7 Build and Train Model](#27) <br>
   * [2.8 Model Evaluation](#28) <br>
   * [2.9 Factor Analysis and Suggestion](#29) <br>
<br>   
* [**3. Twitter Sentiment Analysis**](#3) <br>
   * [3.1 Sentiment Distribution](#31) <br>
   * [3.2 Word Cloud for Positive Sentiment](#32) <br>
   * [3.3 Word Cloud for Negative Sentiment](#33) <br>
   * [3.4 Word Cloud for Neutral Sentiment](#34) <br>
   * [3.5 Build and Train RNN Model](#35) <br>
<br>
* [**Conclusion**](#4) <br>   

![](http://www.slate.com/content/dam/slate/articles/technology/future_tense/2016/05/160503_FT_cybersecurity-airplanes.jpg.CROP.promo-xlarge2.jpg)
<a id="0"></a>
___
## _Preamble_: overview of the dataset

First, load all the packages that will be needed during this project, and then, read all files that contains the details of all the flights that occured in 2015. Now, I output some informations about the flights concerning the types of the variables in the dataframe and the quantity of null values for each variable.

PS: Click the bottom below to show the raw code

In [1]:
import numpy as np
import pandas as pd 
import lightgbm as lgb
import math
import squarify
import plotly.offline as py
py.offline.init_notebook_mode()
pd.options.display.max_columns = 50
import plotly.graph_objs as go
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from sklearn.metrics import classification_report
import seaborn as sns
from wordcloud import WordCloud, STOPWORDS
import warnings
warnings.filterwarnings('ignore')
from sklearn.preprocessing import LabelEncoder, MinMaxScaler, StandardScaler
from sklearn.cross_validation import train_test_split
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Input,Dropout,Dense,BatchNormalization,Activation,concatenate,GRU,Embedding,Flatten
from keras.models import Model,Sequential
from keras import backend as K
from IPython.display import HTML
HTML('''
<script>
  code_show=true; 
  function code_toggle() {
   if (code_show){
     $('div.input').hide();
   } else {
     $('div.input').show();
   }
   code_show = !code_show;
  } 
  $(document).ready(code_toggle);
</script>
<form action="javascript:code_toggle()">
  <input type="submit" value="Click here to toggle on/off the raw code.">
</form>''')

KeyboardInterrupt: 

In [None]:
airlines = pd.read_csv('../input/flight-delays/airlines.csv')
airports = pd.read_csv('../input/flight-delays/airports.csv')
flights = pd.read_csv('../input/flight-delays/flights.csv',low_memory=False)
tweet = pd.read_csv('../input/twitter-airline-sentiment/Tweets.csv', parse_dates=['tweet_created'])

In [None]:
print('Flights Data dimensions:', flights.shape)
tab_info=pd.DataFrame(flights.dtypes).T.rename(index={0:'column type'})
tab_info=tab_info.append(pd.DataFrame(flights.isnull().sum()).T.rename(index={0:'null values (nb)'}))
tab_info=tab_info.append(pd.DataFrame(flights.isnull().sum()/flights.shape[0]*100)
                         .T.rename(index={0:'null values (%)'}))
tab_info

Each entry of the `flights.csv` file corresponds to a flight and we see that more than 5'800'000 flights have been recorded in 2015. These flights are described according to 31 variables. I briefly recall the meaning of the variables that will be used in this notebook:

- **YEAR, MONTH, DAY, DAY_OF_WEEK**: dates of the flight <br/>
- **AIRLINE**: An identification number assigned by US DOT to identify a unique airline <br/>
- **ORIGIN_AIRPORT** and **DESTINATION_AIRPORT**: code attributed by IATA to identify the airports <br/>
- **SCHEDULED_DEPARTURE** and **SCHEDULED_ARRIVAL** : scheduled times of take-off and landing <br/> 
- **DEPARTURE_TIME** and **ARRIVAL_TIME**: real times at which take-off and landing took place <br/> 
- **DEPARTURE_DELAY** and **ARRIVAL_DELAY**: difference (in minutes) between planned and real times <br/> 
- **SECURITY_DELAY** and **AIRLINE_DELAY**:  delay caused by security and airline<br/> 
- **LATE_AIRCRAFT_DELAY** and **WEATHER_DELAY**:  delay caused by aircraft and weather<br/> 
- **AIR_SYSTEM_DELAY** and **DISTANCE**: delay caused by air system and distance (in miles)  <br/>

Now, let's get some information about the Twitter.

In [None]:
print('Twitter Data dimensions:', tweet.shape)
tab_info=pd.DataFrame(tweet.dtypes).T.rename(index={0:'column type'})
tab_info=tab_info.append(pd.DataFrame(tweet.isnull().sum()).T.rename(index={0:'null values (nb)'}))
tab_info=tab_info.append(pd.DataFrame(tweet.isnull().sum()/tweet.shape[0]*100)
                         .T.rename(index={0:'null values (%)'}))
tab_info

Each entry of the `tweet.csv` file corresponds to a tweet and we see that more than 14,000 tweets have been recorded in 2015. These flights are described according to 15 variables. I explain meaning of each variables which will be used in this notebook:

- **airline_sentiment** and **airline_sentiment_confidence**: the customer sentiment and confidence for this airline <br/>
- **negativereason** and **negativereason_confidence**: reason and confidence for this nagetive sentiment <br/>
- **airline** and **retweet_count**: airline name and retweet count<br/> 
- **text** and **tweet_location**: the content and location of the tweet  <br/>
- **user_timezone** and **tweet_created**: the timezone of the user and the time when this tweet was created<br/>

The `airlines.csv` file gives us the airline abreviations and its corresponding airline name. The visualizations below use the abreviation instead of full name, since they are much shorter

In [None]:
airlines

<a id="1"></a>
___
## 1. Flights and Airlines Data Visualization
___
<a id="11"></a>
### 1.1 Flights Overall Situation in US

To have a global overview of the geographical data covered in this dataset, we can plot the airports location and the flight path between them on the world map. The size of markers indicate the count of flights which took off at this airport.

In [None]:
flights = flights.merge(airports, left_on='ORIGIN_AIRPORT', right_on='IATA_CODE')
flights = flights.rename(columns={'AIRPORT':'ORIGIN_AIRPORT_NAME','CITY':'ORIGIN_CITY','STATE':'ORIGIN_STATE',
                                         'LATITUDE':'ORIGIN_LATITUDE','LONGITUDE':'ORIGIN_LONGITUDE'})
flights = flights.drop(['COUNTRY','IATA_CODE'],axis=1)
flights = flights.merge(airports, left_on='DESTINATION_AIRPORT', right_on='IATA_CODE')
flights = flights.rename(columns={'AIRPORT':'DESTINATION_AIRPORT_NAME','CITY':'DESTINATION_CITY',
                            'STATE':'DESTINATION_STATE','LATITUDE':'DESTINATION_LATITUDE','LONGITUDE':'DESTINATION_LONGITUDE'})
flights = flights.drop(['COUNTRY','IATA_CODE'],axis=1)
count = flights.ORIGIN_AIRPORT.value_counts().reset_index()
count.columns = ['IATA_CODE','COUNT']
airports = airports.merge(count, on='IATA_CODE')
df_path = flights[['DESTINATION_LONGITUDE', 'DESTINATION_LATITUDE','ORIGIN_LONGITUDE','ORIGIN_LATITUDE']]
df_path = df_path.drop_duplicates()
df_path = df_path.reset_index()
AIRPORT = [ dict(
        type = 'scattergeo',
        lon = airports['LONGITUDE'],
        lat = airports['LATITUDE'],
        hoverinfo = 'text',
        text = airports['AIRPORT'],
        mode = 'markers',
        marker = dict( 
            size=np.log10(airports['COUNT'])/np.log10(5)*1.2, 
            color=airports['COUNT'],
            colorscale='Viridis',
            colorbar=dict(
            thickness=10,
            titleside='right',
            outlinecolor='rgba(68,68,68,0)',
            ticks='outside',
            ticklen= 3,
            ticksuffix=' flights count',
            dtick= 30000
             ),
            line = dict(
                width=3,
                color='rgba(68, 68, 68, 0)'
            )
        ))]
flight_paths = []
for i in range(len(df_path)):
    flight_paths.append(dict(
            type = 'scattergeo',
            lon = [df_path['ORIGIN_LONGITUDE'][i], df_path['DESTINATION_LONGITUDE'][i]],
            lat = [df_path['ORIGIN_LATITUDE'][i], df_path['DESTINATION_LATITUDE'][i]],
            mode = 'lines',
            line = dict(
                width = 0.09,
                color = 'red',
            ),
            opacity = 0.5,
        ))
layout = dict(
        title = 'Flights Count and Distribution across US in 2015<br>(Click and drag to move and use wheel to zoom in)',
        titlefont={"size": 26},
        autosize=True,
        showlegend = False,
        geo = dict(
            resolution = 50,
            showland = True,
            showlakes = True,
            landcolor = 'rgb(204, 204, 204)',
            countrycolor = 'rgb(204, 204, 204)',
            lakecolor = 'rgb(255, 255, 255)',
            projection = dict( type="equirectangular" ),
            coastlinewidth = 2,
            lataxis = dict(
                range = [ 25, 50],
                showgrid = True,
                tickmode = "linear",
                dtick = 10
            ),
            lonaxis = dict(
                range = [-125, -69],
                showgrid = True,
                tickmode = "linear",
                dtick = 20
            ),
        )
    )
fig = dict( data=flight_paths + AIRPORT, layout=layout )
py.iplot( fig, validate=False)

#### Analysis:  Big cities such as Chicago, Atlanta, San Francisco and New York have numerous flights took off in 2015 while the north-west and north part of America have least flights to take off. From general, airports located in US east coast are busier than airports in west coast.

<a id="12"></a>
### 1.2 Flights Distribution in Airlines and Cities

The following two pie charts give the percentage of flights for each airline and city. Here, the second pie chart only consider the top 14 cities.

In [None]:
def calculateTextpositions(values):
    total = sum(values)
    return values.apply(lambda v: 'none' if float(v)/total < 0.01 else 'auto')
COUNT_AIRLINE = flights.AIRLINE.value_counts().reset_index()
COUNT_AIRLINE.columns = ['AIRLINE_NAME','COUNT']
COUNT_CITY = flights.ORIGIN_CITY.value_counts().reset_index()
COUNT_CITY.columns = ['CITY_NAME','COUNT']
trace1 = go.Pie(labels=list(COUNT_AIRLINE['AIRLINE_NAME']),
            values=list(COUNT_AIRLINE['COUNT']),
            domain = {"x": [0, 0.47]},
            name = 'AIRLINE',
            textinfo = "label+percent",
            textfont = {"size" : 8},
            textposition =calculateTextpositions(COUNT_AIRLINE['COUNT']),
            legendgroup='group1',
            hole= .4)
trace2 = go.Pie(labels=list(COUNT_CITY.head(14)['CITY_NAME']),
            values=list(COUNT_CITY.head(14)['COUNT']),
            domain = {"x": [0.53, 1]},
            name = 'CITY',
            legendgroup='group2',
            textinfo = "label+percent",
            textfont = {"size" : 8},
            textposition =calculateTextpositions(COUNT_CITY.head(14)['COUNT']),
            hole= .4)
layout = go.Layout(
    title="Distributions for Different Flights",
    titlefont={"size": 26},
    legend=dict(orientation="h",x=-.3, y=1.2,font=dict(size=6)),
    annotations = [
            {
                "font": {
                    "size": 25
                },
                "showarrow": False,
                "text": "AIRLINE",
                "x": 0.155,
                "y": 0.5
            },
            {
                "font": {
                    "size": 25
                },
                "showarrow": False,
                "text": "CITY",
                "x": 0.808,
                "y": 0.5
            }
        ]
)
fig = go.Figure(data=[trace1,trace2], layout=layout)
py.iplot(fig)

#### Analysis:  From the first pie chart, we see that there is some disparity between the carriers. For exemple, Southwest Airlines accounts for  ∼∼ 20% of the flights which is similar to the number of flights chartered by the 7 tiniest airlines. Now, if we have a look at the second pie chart, we see that here the differences of flights percentage among cities are less pronounced. Chicago and Atlanta are the most popular cities in US, each of them is nearlly 3 times than Seattle.

<a id="13"></a>
### 1.3 Flights Among States

Now flights is a grouping to states as we can see from the latter analysis, there are a total of 51 distinct states. 

In [None]:
fig = plt.figure(figsize=(25, 21))
marrimeko=flights.ORIGIN_STATE.value_counts().to_frame()
ax = fig.add_subplot(111, aspect="equal")
ax = squarify.plot(sizes=marrimeko['ORIGIN_STATE'].values,label=marrimeko.index,
              color=sns.color_palette('cubehelix_r', 28), alpha=1)
ax.set_xticks([])
ax.set_yticks([])
fig=plt.gcf()
fig.set_size_inches(40,25)
plt.title("Treemap of Flights Counts Across different States", fontsize=30)
plt.show()

#### Analysis:  CA, TX and FL have the most flights took off, followed by IL, GA and NY.

<a id="14"></a>
### 1.4 Monthly Changes of Flights

Let's see the change of flights from January to December.

In [None]:
count_month = flights.groupby(['MONTH', 'AIRLINE']).count().reset_index()[['MONTH','AIRLINE','YEAR']]
data = []
for name in count_month['AIRLINE'].unique():
    trace = go.Scatter(
        x = count_month[count_month['AIRLINE']==name]['MONTH'],
        y = count_month[count_month['AIRLINE']==name]['YEAR'],
        mode = 'lines+markers',
        name = name
    )
    data.append(trace)

layout = dict(title = 'Monthly Changes of Flights',
              xaxis = dict(title = 'Month'),
              yaxis = dict(title = 'Flghts Count'),
              )

fig = dict(data=data, layout=layout)
py.iplot(fig)

#### Analysis:  Almost all airlines have lowest flights in February because of the weather. American Airlines Inc. has a great growth in flights counts from June to July.

<a id="15"></a>
### 1.5 Flights Distance Over City and Airline

Now, the next 3D plot shows the flights distance distribution.

In [None]:
from plotly.graph_objs import *
TOP_AIRLINE = np.array(flights.AIRLINE.value_counts().index[:8])
TOP_CITY = np.array(flights.ORIGIN_CITY.value_counts().index[:8])
flights_sample = flights[flights['AIRLINE'].isin(TOP_AIRLINE)]
flights_sample = flights_sample[flights_sample['ORIGIN_CITY'].isin(TOP_CITY)]
flights_sample = flights_sample[flights.CANCELLED==0]
flights_sample = flights_sample.drop_duplicates(subset='DISTANCE')
trace1 = Scatter3d(
    x=flights_sample['AIRLINE'],
    y=flights_sample['ORIGIN_CITY'],
    z=flights_sample['DISTANCE'],
    text=flights_sample['FLIGHT_NUMBER'],
    mode='markers',
    marker=dict(
        sizemode = 'diameter',
        sizeref = 750,
        size= flights_sample['DISTANCE']*4,
        color = flights_sample['DISTANCE'],
        colorscale = 'Viridis',
        colorbar = dict(title = 'Distance<br>Longth'),
        line=dict(color='rgb(140, 140, 170)')
    )
)
data=[trace1]
layout=dict(height=800, width=800, title='Flights Distance Over City and Airline')
fig=dict(data=data, layout=layout)
py.iplot(fig)

#### Analysis: Delta Air Lines Inc. has the longest trip flight, while United Air Lines Inc. and American Airlines Inc. have many long distance flights too. Los Angeles and San Francisco have no very long distance flights, since they are located in west coast of US.

<a id="16"></a>
### 1.6 Canceled Flights Situation

As for flights canceled in 2015, there are 4 reasons for cancellation. This bar chart shows the frequency of different reasons per airline.

In [None]:
CANCELLED_FLIGHTS = flights[flights.CANCELLED==1]
AC = CANCELLED_FLIGHTS.AIRLINE.value_counts().reset_index()
AC.columns=['airline','count']
def decide_reason(x):
    if x == 'A':
        return 'Airline/Carrier'
    elif x =='B':
        return 'Weather'
    elif x=='C':
        return 'National Air System'
    else:
        return 'Security'
CANCELLED_FLIGHTS['Cancle Reason'] = flights['CANCELLATION_REASON'].apply(decide_reason)
fig = plt.figure(figsize=(15, 6))
ax = sns.countplot(data = CANCELLED_FLIGHTS, x = 'AIRLINE', hue='Cancle Reason', order=AC.airline)
ax.set_title('Canceled Flights Count For Different Airline',size=30)
ax.set_ylabel('Count',size=20)
ax.set_xlabel('Airline',size=20)
plt.legend(loc=1)
plt.show()

###### Analysis:  We can find that American Eagle Airlines Inc. have only 5.11% flights counts from last pie chart, but it has nearlly same canceled flights as Southwest Airlines, which shows that American Eagle Airlines Inc. has the highest chance to cancel its flights. Among the cancellation reasons, weather is the most common reason except for Atlantic Southeast Airlines, Virgin America and Hawaiian Airlines Inc.
<a id="2"></a>
___
## 2. Delay Analysis and Predicting
___
<a id="21"></a>
### 2.1 Different Delays Correlation

There are different kinds of delay. So, I plot a heatmap to show the pearson correlation between them.

In [None]:
flights = flights[flights.CANCELLED==0]
flights = flights.fillna(0)
different_type_delay = flights[['DEPARTURE_DELAY', 'ARRIVAL_DELAY', 'AIR_SYSTEM_DELAY', 'SECURITY_DELAY' , 'AIRLINE_DELAY', 
                               'LATE_AIRCRAFT_DELAY', 'WEATHER_DELAY']]
colormap = plt.cm.magma
plt.figure(figsize=(16,12))
plt.title('Pearson Correlation of Different Delays', y=1.05, size=30)
sns.heatmap(different_type_delay.corr(),linewidths=0.1,vmax=1.0, square=True, 
            cmap=colormap, linecolor='white', annot=True)
plt.show()

#### Analysis:  Arrival delay has high relation with departure delay. But it has not so high dependecy with other type of delay such as security delay. Air system delay has the least relation with other delays.
<a id="22"></a>
### 2.2 Basic Statistical Description of Delays  

Here, the aim is to classify the airlines with respect to their punctuality and for that purpose. I compute a few basic statisticial parameters.

This chart only considers the arrival delay, since I think it is the most important delay.

In [None]:
def get_stats(group):
    return {'min': group.min(), 'max': group.max(),
            'count': group.count(), 'mean': group.mean()}

global_stats = flights['ARRIVAL_DELAY'].groupby(flights['AIRLINE']).apply(get_stats).unstack()
global_stats = global_stats.sort_values('count',ascending=False)
global_stats

#### Analysis:  Spirit Air Lines and Frontier Airlines Inc. have the highest mean delay,  meaning that you will lose your time if you choose them. However, note that  Alaska Airlines Inc.'s mean delay is quite low that the standard for its flights is to respect the schedule.
<a id="23"></a>
### 2.3 Delays Heatmap 

Now, I consider all the flights from carriers which have high mean arrival delay. In order to facilitate the delay distribution information, I construct the heatmap between carriers and city, trying to define if there is a correlation between the delays registered and the city of origin.

In [None]:
airport_mean_delays = pd.DataFrame(pd.Series(flights['ORIGIN_AIRPORT'].unique()))
airport_mean_delays.set_index(0, drop = True, inplace = True)
identify_airport = airports.set_index('IATA_CODE')['CITY'].to_dict()
abbr_companies = airlines.set_index('IATA_CODE')['AIRLINE'].to_dict()
for carrier in abbr_companies.keys():
    fg1 = flights[flights['AIRLINE'] == carrier]
    test = fg1['ARRIVAL_DELAY'].groupby(flights['ORIGIN_AIRPORT']).apply(get_stats).unstack()
    airport_mean_delays[carrier] = test.loc[:, 'mean'] 
sns.set(context="paper")
fig = plt.figure(1, figsize=(12,12))

ax = fig.add_subplot(1,2,1)
subset = airport_mean_delays.iloc[:40,:].rename(columns = abbr_companies)
subset = subset.rename(index = identify_airport)
mask = subset.isnull()
sns.heatmap(subset, linewidths=0.05, cmap="YlGnBu", mask=mask, vmin = 0, vmax = 30)
plt.setp(ax.get_xticklabels(), fontsize=12, rotation = 88) ;
ax.yaxis.label.set_visible(False)

ax = fig.add_subplot(1,2,2)    
subset = airport_mean_delays.iloc[40:80,:].rename(columns = abbr_companies)
subset = subset.rename(index = identify_airport)
fig.text(0.5, 1.02, "Scale of Delays From Origin City", ha='center', fontsize = 20)
mask = subset.isnull()
sns.heatmap(subset, linewidths=0.05, cmap="YlGnBu", mask=mask, vmin = 0, vmax = 30)
plt.setp(ax.get_xticklabels(), fontsize=12, rotation = 88) ;
ax.yaxis.label.set_visible(False)

plt.tight_layout()

#### Analysis: The delays are highly dependent with cities. For example, Frontier Airlines Inc. has high delay in Chicago compare to other cities.
<a id="24"></a>
### 2.4 Delays Distribution

I plot this stripe plot to get a macrostatistics of arrival delays for airlines.

In [None]:
df2 = flights.loc[:, ['AIRLINE', 'ARRIVAL_DELAY']]
df2['AIRLINE'] = df2['AIRLINE'].replace(abbr_companies)
fig = plt.figure(figsize=(14, 8))
colors = ['firebrick', 'gold', 'lightcoral', 'aquamarine', 'c', 'yellowgreen', 'grey',
          'seagreen', 'tomato', 'violet', 'wheat', 'chartreuse', 'lightskyblue', 'royalblue']
ax = sns.stripplot(y="AIRLINE", x="ARRIVAL_DELAY", size = 1.7, palette = colors,
                    data=df2, linewidth = 0.3,  jitter=True)
plt.setp(ax.get_xticklabels(), fontsize=14)
plt.setp(ax.get_yticklabels(), fontsize=14)
ax.set_xticklabels(['{:2.0f}h{:2.0f}m'.format(*[int(y) for y in divmod(x,60)])
                         for x in ax.get_xticks()])
plt.xlabel('ARRIVAL DELAY', fontsize=18, bbox={'facecolor':'midnightblue', 'pad':5},
           color='w', labelpad=20)
ax.yaxis.label.set_visible(False)
ax.set_title('Flights Delays for Different Airlines',size=30)

plt.show()
del df2 , abbr_companies

#### Analysis:  There are a lot of flights from American Airlines which have high arrival delay(Even bigger than 18h). However, the delays of flights from Hawaiian Airlines Inc. are mainly within 6h, which is very good.

<a id="25"></a>
### 2.5 Delays Type

To see more details, I divide arrival dalays into two different types based on my own experience.

The decision is that the arrival delays which exceed 45 minutes are "Seriously Delay". 

In [None]:
flights['DELAY_LEVEL'] = flights['ARRIVAL_DELAY'].apply(lambda x:(0,1)[x > 45])
zero_list = []
one_list = []
for col in list(flights['AIRLINE'].unique()):
    zero_list.append(flights[flights['AIRLINE']==col].groupby('DELAY_LEVEL').count()['YEAR'][0])
    one_list.append(flights[flights['AIRLINE']==col].groupby('DELAY_LEVEL').count()['YEAR'][1])
trace1 = go.Bar(
    x=list(flights['AIRLINE'].unique()),
    y=zero_list ,
    name='Not Seriously Delay'
)
trace2 = go.Bar(
    x=list(flights['AIRLINE'].unique()),
    y=one_list,
    name='Seriously Delay'
)

data = [trace1, trace2]
layout = go.Layout(
    barmode='stack',
    title='Count of flights Based on Delay Type',
    titlefont={"size": 36}
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig)

#### Analysis:  We can find that flights from Southwest Airlines Co. have highest possibility to cause seriously delay. The airlines which have few flights count commonly have low chance to cause seriously delay.

<a id="26"></a>
### 2.6 Speed Analysis

We can figure out that the relation between arrival delay and flight speed using the violin plot.

Here, I used this euqation: **v = s/t** (v = speed, s = distance and t = time)

In [None]:
flights['FLIGHT_SPEED'] = 60*flights['DISTANCE']/flights['AIR_TIME']
flights['DELAY_LEVEL_NAME'] = flights['DELAY_LEVEL'].apply(lambda x:('Not Seriously Delay','Seriously Delay')[x == 1])
fig= plt.figure(figsize=(12,9))
ax = sns.violinplot(data=flights[flights['FLIGHT_SPEED']<100000], y='AIRLINE',x='FLIGHT_SPEED',hue='DELAY_LEVEL_NAME')
ax.set_title('Flight Speed for Different Airlines',size=30)
ax.set_ylabel('Airline',size=20)
ax.set_xlabel('Flight Speed (miles/hour)',size=20)
plt.setp(ax.get_legend().get_texts(), fontsize='20')
plt.legend(bbox_to_anchor=(1.01, 1), loc=2, borderaxespad=0.)
plt.show()

#### Analysis:  Basicly, flights causing seriously delay is slower, which means that the speed is a huge influence factor that lead to delay. The exception is flights from Hawaiian Airlines Inc., but considering that flights count of  Hawaiian Airlines Inc. is very low(only 1.31% in total) we are still right.

<a id="27"></a>
### 2.7 Build and Train Model

Now, I will build a model to predict the arrival delays. First, I need to choose the right data from the origin dataset, since many of them indicates the answer. For example, the AIR_TIME represent the whole flying time, which you can get the arrival delays by substraction. I choose the data which contains different delays except departure delay according to the relation heatmap above. After using grid search to tune the best parameters for this lightGBM model on my own computer, I can build the model properly and fit the training data.

In [None]:
flights = pd.read_csv('../input/flight-delays/flights.csv',low_memory=False)
flights = flights[~flights['AIR_SYSTEM_DELAY'].isnull()]
target = flights.ARRIVAL_DELAY.values.copy()
y = np.log1p(target)
feature_columns = ['MONTH', 'DAY', 'DAY_OF_WEEK', 'AIRLINE', 'FLIGHT_NUMBER',
       'TAIL_NUMBER', 'ORIGIN_AIRPORT', 'DESTINATION_AIRPORT',
       'SCHEDULED_DEPARTURE','SCHEDULED_TIME', 'DISTANCE','SCHEDULED_ARRIVAL', 'LATE_AIRCRAFT_DELAY',
       'AIR_SYSTEM_DELAY', 'SECURITY_DELAY', 'AIRLINE_DELAY', 'WEATHER_DELAY']
flights = flights[feature_columns]
cat_col = ['AIRLINE','TAIL_NUMBER', 'ORIGIN_AIRPORT', 'DESTINATION_AIRPORT']
def encoder(col_name):
    le = LabelEncoder()
    return le.fit_transform(flights[col_name].astype(str).values)
for col in cat_col:
    flights[col] = encoder(col)
Xt, Xv, yt, yv = train_test_split(flights, y, test_size=0.2, random_state=42)
params = {
       'learning_rate': 0.75,
        'application': 'regression',
        'max_depth': 3,
        'num_leaves': 1000,
        'verbosity': -1,
        'metric': 'RMSE'
}
evals_result = {} 
d_train = lgb.Dataset(Xt, label=yt)
d_valid = lgb.Dataset(Xv, label=yv)
model = lgb.train(params, d_train, 6000, valid_sets=[d_valid], verbose_eval=1000,evals_result=evals_result)

<a id="28"></a>
### 2.8 Model Evaluation

Evaluate the performance of the regression model based on the test set.

Here, I use **Root Mean Squared Logarithmic Error** as the metrics.

In [None]:
def rmsle(y, y_pred):
    assert len(y) == len(y_pred)
    to_sum = [(math.log(y_pred[i] + 1) - math.log(y[i] + 1)) ** 2.0 for i,pred in enumerate(y_pred)]
    return (sum(to_sum) * (1.0/len(y))) ** 0.5
y_preds = model.predict(Xv)
y_preds = np.expm1(y_preds)
y_true = np.expm1(yv)
v_rmsle = rmsle(y_true, y_preds)
print(" RMSLE error on dev test: "+str(v_rmsle))

#### Analysis:  The RMSLE is really low(<0.01), meaning that our model is very successful. We can  accurately predict arrival delay based on the provided information. In other words, we can decrease arrival delay according to the model.

<a id="29"></a>
### 2.9 Factor Analysis and Suggestion

In order to decreasing the delay, let's see the influence of different factors, and decide the strategies.

In [None]:
ax = lgb.plot_importance(model)
plt.show()

#### Analysis: Late aircraft delay Airline delay is the most important factors. Airline company should invest more money on reducing late aircraft delay while ignoring security delay which is unimportant.
<a id="3"></a>
___
## 3. Twitter Sentiment Analysis
___
<a id="31"></a>
### 3.1 Sentiment Distribution

Let's see the overall information for the Twitter sentiments for different airlines

In [None]:
positive_list = []
negative_list = []
neutral_list = []
for col in list(tweet['airline'].unique()):
    positive_list.append(tweet[tweet['airline']==col].groupby('airline_sentiment').count()['tweet_id'].positive)
    negative_list.append(tweet[tweet['airline']==col].groupby('airline_sentiment').count()['tweet_id'].negative)
    neutral_list.append(tweet[tweet['airline']==col].groupby('airline_sentiment').count()['tweet_id'].neutral)
trace1 = go.Bar(
    x=list(tweet['airline'].unique()),
    y=positive_list ,
    name='Positive Sentiment'
)
trace2 = go.Bar(
    x=list(tweet['airline'].unique()),
    y=negative_list,
    name='Negative Sentiment'
)
trace3 = go.Bar(
    x=list(tweet['airline'].unique()),
    y=neutral_list,
    name='Neutral Sentiment'
)

data = [trace1, trace2, trace3]
layout = go.Layout(
    barmode='stack',
    title='Customer sentiment for different airlines',
    titlefont={"size": 36}
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig)

#### Analysis: United airlines are the most popular airline, receiving most comment, while few people choose Virgin America. Delta has highest quality of service because of the percentage of positive feedback.

<a id="32"></a>
### 3.2 Word Cloud for Positive Sentiment

What will people say when they have positive sentiment for this airline?

In [None]:
X = tweet['text']
y = tweet['airline_sentiment']
negative = y == 'negative'
positive = y == 'positive'
neutral = y == 'neutral'
stopwords = set(STOPWORDS)
def create_wordcloud(subset):
    plt.figure(figsize=(20,15))
    wc = WordCloud(background_color="white", max_words=2000, 
                   stopwords=stopwords, max_font_size= 40)
    wc.generate(" ".join(tweet[subset]['text'].str.lower()))
    plt.imshow(wc.recolor(colormap='ocean', random_state=37), alpha=0.98, interpolation="bilinear")
    plt.axis('off');
    
stopwords.update(['flight','delta','jetblue','americanair','usairway','southwestair','usairways',
                  'united','virginamerica','southwest'])
create_wordcloud(positive)

#### Analysis:  Thank, great, awesome, good, appreciate and amazing are popular when people have positive sentiment. The customers are tended to share their good mood after enjoying a high quality flying trip.

<a id="33"></a>
### 3.3 Word Cloud for Negative Sentiment

Now, Let's see the common word when people think the service is bad.

In [None]:
create_wordcloud(negative)

#### Analysis:  We can find that the Tweets with negative moods are frequently involved some words like cancelled, flight ,customer or hour. People might guess that customer tends to complain when they are waiting for the delayed flights.

<a id="34"></a>
### 3.4 Word Cloud for Neutral Sentiment

We can also get the similarities among nueutral sentiments.

In [None]:
create_wordcloud(neutral)

#### Analysis:  "Thank" is still popular in neutral sentiment, while other common words are mostly nouns such as today, tomorrow and ticket. I think people will try the airline again if they have neutral sentiment.

<a id="35"></a>
### 3.5 Build and Train RNN Model

I take the tweet text as the main input while taking other factor such as tweet created time as axillary input to construct the model. The following picture is the exact structure of my RNN model.
![](https://lh3.googleusercontent.com/5wfEY_1v2gFB9ywBTHpb22dfvFR61MDW1aDhyu8l194Nl2x40KsKCAGUIgsEJQjHkhzIVb2--Q1Bwyn8Rj1CnX3K6jRc84blVWxuzwCjDUa_ofVtWV-NdKGLJP5aPiI2vWQdTMWh3yKbY0HJcDQ9W5QhvdNBVAOqM18xx8UAGrkPIkRK3DRK78mf6JpSkCPv8pLnnoL0Cq7ww69iDEKVopo3X4EdhDoAafUi1cgMZhyDsvFGmOC3-Z-5dgyyTXp0Ug0fSb7jigbhVIKtPP1fEnMIdVYY2PyHR3x67L-k3vbtpB8TVzXHqAe-VqeiwY5QKhP4LW4yg6noDCMGBY0_BKiZ47CIb7WC_2Wj4xIyDdIY8mQOmGT29SfK0ccUqn3YNpzkXi0LyGfgr1yljXq0HjyZedrzbL7DA1qd-SnmPgwP-lGfyVi1a3KhuMkaDYCdAg2-UtUdnQ8Y0F3SS-IQth4mDxsT3VQ-1wcgjWCnkBjyOd7F88I-Pb8W-SADBKIFsf8RddcN4kP3MEUIgO34lwkL3czQY8-91tdPkXraoWqqf9qAG_a2bwiQ2GbfkvaoydhfmItI1dBPRU3DM9WiompXZjfp9CPI1mGwx05I=w674-h621-no)

Then, get the best parameters using grid search and train the model. 

In [None]:
tweet['Created_Month'] = tweet['tweet_created'].apply(lambda x: x.month)
tweet['Created_Day'] = tweet['tweet_created'].apply(lambda x: x.day)
tweet = tweet.drop(['tweet_id','airline_sentiment_gold','negativereason_gold','tweet_coord','name','tweet_created'],axis=1)
def handle_missing(dataset):
    dataset.negativereason.fillna(value="None", inplace=True)
    dataset.negativereason_confidence.fillna(value=0, inplace=True)
    dataset.tweet_location.fillna(value="missing", inplace=True)
    dataset.user_timezone.fillna(value="missing", inplace=True)
    return (dataset)
tweet = handle_missing(tweet)
def tweet_encoder(col_name):
    le = LabelEncoder()
    return le.fit_transform(tweet[col_name])
cat_columns = ['airline_sentiment', 'negativereason',
               'airline','tweet_location', 'user_timezone']
for col in cat_columns:
    tweet[col] = tweet_encoder(col)
tok_raw = Tokenizer()
tok_raw.fit_on_texts(tweet['text'])
tweet["text"] = tok_raw.texts_to_sequences(tweet.text.str.lower())
dtrain, dvalid = train_test_split(tweet, random_state=123, train_size=0.8)
dtrain_target = pd.get_dummies(dtrain.airline_sentiment).values
dvalid_target = pd.get_dummies(dvalid.airline_sentiment).values
def get_keras_data(dataset):
    X = {
        'text': pad_sequences(dataset.text, maxlen=36),
        'airline_sentiment_confidence': np.array(dataset.airline_sentiment_confidence),
        'negativereason' : np.array(dataset.negativereason),
         'negativereason_confidence' : np.array(dataset.negativereason_confidence),
         'airline' : np.array(dataset.airline),
         'retweet_count' : np.array(dataset.retweet_count),
         'tweet_location' : np.array(dataset.tweet_location),
        'user_timezone' : np.array(dataset.user_timezone),
        'Created_Month' : np.array(dataset.Created_Month),
        'Created_Day' : np.array(dataset.Created_Day),
    }
    return X
X_train = get_keras_data(dtrain)
X_valid = get_keras_data(dvalid)
MAX_TEXT = np.max([np.max(tweet.text.max())])+80
MAX_NEGATIVEREASON = np.max([tweet.negativereason.max()])+1
MAX_AIRLINE = np.max([tweet.airline.max()])+1
MAX_LOCATION = np.max([tweet.tweet_location.max()])+1
MAX_TIMEZONE = np.max([tweet.user_timezone.max()])+1
def get_model():
    dr_r = 0.5
    
    text = Input(shape=[X_train["text"].shape[1]], name="text")
    airline_sentiment_confidence = Input(shape=[1], name="airline_sentiment_confidence")
    negativereason = Input(shape=[1], name="negativereason")
    negativereason_confidence = Input(shape=[1], name="negativereason_confidence")
    airline = Input(shape=[1], name="airline")
    retweet_count = Input(shape=[1], name="retweet_count")
    tweet_location = Input(shape=[1], name="tweet_location")
    user_timezone = Input(shape=[1], name="user_timezone")
    Created_Month = Input(shape=[1], name="Created_Month")
    Created_Day = Input(shape=[1], name="Created_Day")
    
    emb_text = Embedding(MAX_TEXT, 50)(text)
    emb_negativereason = Embedding(MAX_NEGATIVEREASON, 10)(negativereason)
    emb_airline = Embedding(MAX_AIRLINE, 10)(airline)
    emb_tweet_location = Embedding(MAX_LOCATION, 10)(tweet_location)
    emb_user_timezone = Embedding(MAX_TIMEZONE, 10)(user_timezone)

    rnn_layer = GRU(128) (emb_text)
    
    main_l = concatenate([
        Flatten() (emb_negativereason)
        , Flatten() (emb_airline)
        , Flatten() (emb_tweet_location)
        , Flatten() (emb_user_timezone)
        , rnn_layer
        , Created_Day
        , Created_Month
        , retweet_count
        , negativereason_confidence
        , airline_sentiment_confidence
    ])
    main_l = BatchNormalization() (Dropout(dr_r) (Dense(64) (main_l)))
    main_l = BatchNormalization() (Dropout(dr_r) (Dense(64) (main_l)))
    main_l = BatchNormalization() (Dropout(dr_r) (Dense(64) (main_l)))

    output = Dense(3, activation="softmax") (main_l)
    
    model = Model([text, airline_sentiment_confidence, negativereason, negativereason_confidence,
    airline, retweet_count, tweet_location, user_timezone, Created_Month, Created_Day], output)
    model.compile('sgd', 'categorical_crossentropy', metrics=['accuracy'])
    return model    
model = get_model()

BATCH_SIZE = 16
epochs = 9

model = get_model()
model.fit(X_train, dtrain_target, epochs=epochs, batch_size=BATCH_SIZE
          , validation_data=(X_valid, dvalid_target)
          , verbose=1)

#### Analysis:  The accuracy is quiet high(84.6%), meaning that we can predict the sentiment from people's tweets. We can build an associating recommendation system for those people who want to take a plane and make sure they will satisfy this trip, improving the service.  
<a id="4"></a>
___
## Conclusion
___
This notebook is three-fold. The first part deals with an exploration of the dataset, with the aim of understanding some properties of flights distribution across US. This exploration gives me the occasion of using various vizualization tools offered by python. The second part of the notebook consists in the elaboration of a model aimed at predicting flight arrival delays. For that purpose, I use lightGBM model and show the importance of different factors. The model is so accurate that we could reduce the flights delays according to the results. The third part deals with Twitter tweets and the RNN model performs really well, meaning that there must be some pattern in customer tweets. We can build a recommendation system to improve customer feedback.