# 🚀 Project

* * * 

### Icons Used In This Notebook
🔔 **Question**: A quick question to help you understand what's going on.<br>
🥊 **Challenge**: Interactive excersise. We'll work through these in the workshop!<br>
💭 **Reflection**: Helping you think about programming.<br>
⚠️ **Warning**: Heads-up about tricky stuff or common mistakes.<br>
💡 **Tip**: How to do something a bit more efficiently or effectively.<br>
🎬 **Demo**: Showing off something more advanced – so you know what Python can be used for!<br>

### Learning Objectives
1. [🚀 Project](#project)

# 🚀 Project

### Data: Airline Tweets
In this section, we will go through an example analysis of tweets about airlines. We will bring together the basic programming, loading data, and statistical analysis/visualization techniques from this workshop to analyze airline tweets. 

First, let's import the packages to use in this analysis:

In [None]:
import numpy as np
import pandas as pd
import os
import statsmodels.api as sm

## 1. Getting the data

Before we can get our data, you should know something more about **filepaths**. 

A filepath is the location of a file on your system. There are two kinds of filepaths:

* **absolute**: The filepath from the top level directory (or folder).
    * For Macs, these begin with a forward slash, followed by folders separated by a **forward slash**. E.g. `/Users/[USERNAME]/directory/subdirectory/file`.
    * For Windows, these begin with a backward slash or, more commonly, a volume, e.g. `C:\Documents\directory\subdirectory\file`. Note the **backward slash** to separate folders.
* **relative**: The filepath relative to the current working directory (i.e. notebook location). Common locations include:
    * File in same folder: `./file` or `file` (`.` means 'here').
    * Subfolder: `subfolder/file`.
    * Higher folder: `../sisterfolder/file` (`..` means 'go up one level in the directory').

When you are figuring out what filepath to use, you can use `os.listdir([PATH])` to list all subdirectories in a path. For example, let's see what directories are available to us in the current folder (noted with a dot `.`).

🔔 **Question**: In this current folder we're checking out, which items are folders and which are files? (**Hint:** You can double check by looking at the files in JupyterLab/ Jupyter Notebook).

In [None]:
import os
os.listdir('.')

Looking up the items in the folder after moving up one level works like this:

In [None]:
os.listdir('../')

### 1.1 Our Dataset: Airline Tweets

The dataset we will use is from the [Airline tweets sentiment dataset](https://www.kaggle.com/datasets/crowdflower/twitter-airline-sentiment?resource=download), which contains tweets that tag one of several major airlines. The dataset also includes information about the tweet location and time, the airline mentioned, and the sentiment of the tweet.

There are several files in the "airline" subfolder of the "data" folder in our Python Fundamentals repository. 

🔔 **Question**: Use the File Browser to the left of your screen. Can you find the "airline" folder?

### 1.2 Import Data

Use `os.listdir()` to see the files in the "airlines" folder. 

💡 **Tip**: Remember how to move up in the folder structure? `../../` goes up two folders!

In [None]:
# YOUR CODE HERE




### 1.3 Load in a Single File

First, let's load in a single file and take a look at it. 

1. Read in the `Delta.csv` file as a `pandas` object.
2. How many rows are there? How many columns?
3. Which columns seem most informative? Are there any extra or redundant columns? 
4. Where is airline represented in the csv file?

In [None]:
import pandas as pd

# Load in file for Delta
single_airline = pd.read_csv(___)

It turns out that the airline column is not present in any column, but is in the title of the csv file. Let's extract that information and add it to the DataFrame in a column called `airline`. 

In [None]:
# Example: extracting filename from path
filename = 'airline_data/Delta.csv'.split('/')[1] 
print(filename)

# Splitting the filename again to get rid of '.csv'
filename_short = filename.split('.')[0] 
print(filename_short)

# Fill in the blanks
single_airline['airline'] = filename_short

Let's create a function for this. We call it `process_file()`. It takes in a filepath in an argument called `filepath`, and it returns the dataframe with the airline column added. 

Use the example above to set `name` variable to the split of the filepath.

In [None]:
def process_file(filepath):
    df = pd.read_csv(filepath)
    # Add code to extract airline name and save it to a name variable
    filename = ___
    filename_short = ___
    df['airline'] = name
    return(df)

🔔 **Question**: Here's another filepath: `'data/US-Airways.csv'` What will be in the airline column in the output from the function below?

In [None]:
process_file('../../data/airline_data/US-Airways.csv')

### 1.4 Load in Multiple Files

Now that we have a function, let's iterate through all of files in the directory. 

First, fill in the blanks below. Use `os.listdir` to loop through and print every file in the `airline_data` directory.

In [None]:
directory = '../../data/airline_data'
for file in os.____(____):
    print(______)


We notice that there is a `.txt` file in the directory, which isn't a `pandas` dataframe. This will cause an error in the dataframe processing, so let's use an `if` statement to filter out the `.txt` extension. 

Before implementing this, let's practice. Here's a test filename. Write a statement using the equality operator (`==`). Slice the last 3 characters of the `test_csv` variable to return `True`.

💡 **Tip**: Recall slicing the last elements of a list. For instance, use `some_list[-2:]` to get the last two items.

In [None]:
test_csv = 'delta.csv' # Expression should evaluate True

# YOUR CODE HERE




Now that we have an expression, let's create a for-loop to check if it works over the files in our folder. 

In [None]:
directory = '../../data/airline_data'
for file in os.listdir(directory):
    if _____ # Fill in the blank to filter for files ending with `.csv`
        print(file)

## 🥊 Challenge: Put it all together

We've got most of the pieces. Now let's put the puzzle together:
1) In the cell below, paste the `process_file()` function we created above.
2) Initialize an *accumulator* list called `df_list`.
3) Paste the `for`-loop we just created to loop over the csv files in the right folder. But in the final line, instead of `print`ing the file, `append` the output to our `df_list` list.

⚠️ **Warning**: Note that when calling `process_file()` in the for-loop, you'll need to pass the **full filepath**, not just the file name! 

In [None]:
# YOUR CODE HERE





Finally, look up the [documentation for Pandas](https://pandas.pydata.org/pandas-docs/stable/reference/general_functions.html), and see if you can find a function that **concatenates** the list of DataFrames we have now. We'll save the concatenated list in a variable called `df`.

In [None]:
# YOUR CODE HERE





🔔 **Question**: Let's take a look at the final data frame.

1. How many rows and columns are there in the total dataframe?
2. How many unique airlines are in the dataset?
3. How many numeric columns are there in the dataset?

## 2. Data Processing

Now that we have some data, let's take a look at some data processing steps.

### 2.1 Nulls

First, let's summarize the null values in the dataset. We want to see which columns have null values and how many. 

You recall that `.isnull()` is a method that returns `True` where there are null values and a `False` otherwise in a DataFrame. 

You look up finding the sum of null values in Pandas and find a suggestion to use `df.isnull().sum(axis=1)`. You try this out on your data set and get the output below. Is this the expected output? If not, how can you modify the code to find the number of null values in each column. 


In [None]:
df.isnull().sum(axis=1)

Although there are null values in the data set, We won't be using any of the columns with null values in the analysis, so we don't need to drop any rows from this dataset. 

Let's drop the following columns:

* `tweet_id`
* `airline_sentiment_confidence`
* `negativereason_confidence`
* `airline_sentiment_gold`
* `airline_sentiment_gold`
* `tweet_coord`
* `tweet_location`
* `user_timezone`

This will make the dataset more manageable for further analysis.

In [None]:
columns_to_drop = [
    'Unnamed: 0',
    'tweet_coord',
    'tweet_id',
    'user_timezone',
    'tweet_created',
    'tweet_location',
    'negativereason_gold',
    'airline_sentiment_gold']
list(df)

# YOUR CODE HERE
df.____

### 2.2 Feature Extraction

Now let's do some basic preprocessing on the data. First, let's look at the first few rows of the dataframe. 

In [None]:
df.head(3)

Let's do a couple of simple feature extraction on the text data, including the number of words. Let's make three new columns:
1. `word_count`: number of words in each tweet (💡 Tip: use `.split() and .len()`).
2. `mentions` : count number of '@' symbols (💡 Tip: use `.count()`).

💡 **Tip**: Remember that you can use `Series.str` to access vectorized string functions!

In [None]:
# YOUR CODE HERE
df['word_count'] = ___
df['mentions'] = ___

# Final one of your choice

Next steps in text preprocessing would often use tokenization or vectorization on tweets, to convert the words themselves to numerical data for preprocessing. If you are interested, check out the Python Text Analysis workshop! 


### 2.3 Subset Tweets

🔔 **Question**: How many sentiment types are there in the DataFrame? 

For our exploratory analysis, let's start by looking just at postive/negative tweets.

1. Subset the dataframe.
2. What proportion of the tweets have a positive sentiment?

What is the condition that we would use to subset the dataframe? Subset the dataframe for non-neutral tweets and save it to a dataframe called `pos_neg_df`.

💡 **Tip**: You can use `!=` to check for all values not equal to a certain value.

In [None]:
# YOUR CODE HERE




### 2.4 Convert column to int

If we want to do exploratory analysis with the "airline_sentiment" column we just created, we will need to convert the categorical variables (currently in string format) into numbers. We can use `.replace()` to do so. Look up the [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.replace.html) and fill in the arguments needed for `.replace()` below.

In [None]:
pos_neg_df['airline_sentiment'].replace(___, ___, inplace=True)

## 3. Exploratory Analysis


Now that we've done some very basic processing on the `DataFrame`, let's do some exploratory analyses on the data. 

###  3.1 Most Common Users, Most Frequent Airlines

Let's look at the users tweeting at the airlines. Using the `DataFrame`, answer the following questions:

1. How many unique users are there in the dataset? 
2. Who tweeted the most about airlines in this dataset? (**Hint**: consider `df.value_counts()`)
3. Choose one of the users with the top five most tweets. Which airline are they tweeting about?

💡 **Tip**: Users are recorded in the `name` column.

In [None]:
# YOUR CODE HERE



### 3.2 Visualization

Now, let's visualize some component of the data set. Use a **histogram** to visualize the `word_counts` column. Consider plotting two layers: one for negative tweets and one for positive tweets. 


In [None]:
# YOUR CODE HERE




### 3.3 Linear Regression of Tweet Length

Let's use a linear regression to look at other predictors of tweet length. Complete the steps:

1. Select the numeric columns 'airline_sentiment','airline_sentiment_confidence','retweet_count','mentions', and save it as `X` (except wordcount)
2. Select the word_count column and save as `y`
3. Set up a linear regression and fit it to the data using `sm.OLS()`
4. Interpret the model summary

In [None]:
# YOUR CODE HERE
X = ___
y = ___
model = ___

model.summary()

## Writing Files

Finally, a `pd.DataFrame` can be exported to a `.csv` (or other filetype) using `df.to_csv()`. This is a method function built-in to every data frame.

🔔 **Question**:  Where does `airlines_sentiment.csv` get saved? What if you wanted to save it to the "data" directory?

In [None]:
pos_neg_df.to_csv('airlines_sentiment.csv') 

## 🥊 Take-home challenge: Are Negative Tweets Longer than Positive Tweets?

For those who want some more practive after the workshop, here's a challenge.

Take a look at the negative and positive tweets. We are interested in the whether negative tweets are longer or shorter than positive tweets. Let's test this with a t-test.

1. Subset the data into positive and negative tweets.
2. Select the `word_count` column.
3. Calculate the mean word count for each column. Which mean is higher?
3. Use a t-test to compare the two sets of values from (2). What is the p-value of the result? 
4. Plot a histogram layer for both positive and negative tweet word counts. What do you notice about the distribution?

💡 **Tip**: Refer to the [first section](#stats) of this notebook for an example!

In [None]:
# Subset dataframe

# Calculate mean


# Run t-test
res = sm.stats.ttest_ind(___)


# Plot (kind = 'hist')


# 🎉 Well done!

**This concludes Python Fundamentals II!**

Today's project took us through importing multiple csv files, data manipulation, and some basic visualizations and analysis of data. 

If you were working on this dataset, what would you potentially do next? It could be either an analysis, a new feature to include, a visualization that might help represent the data, etc.

### 💡 Tip: More workshops!

D-Lab teaches workshops that allow you to practice more with DataFrames and visualization.

- To learn more about data wrangling, check out D-Lab's [Python Data Wrangling workshop](https://github.com/dlab-berkeley/Python-Data-Wrangling).
- To learn more about data visualization, check out D-Lab's [Python Data Visualization workshop](https://github.com/dlab-berkeley/Python-Data-Visualization).