# Exercises for Pandas/Seaborn

The dataset we'll be using contains news headline data, source, sentiment, and shares to social media. You can [download it from the UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/machine-learning-databases/00432/Data/News_Final.csv). You *can* just import it using `pd.read_csv()` pointing to the URL, but it's a large file, so download it once.

### Load Libraries

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

___
## Day 1: Intro to Pandas

### Introductory Bits

* Load in the dataset to a variable called `news`, and view the first 3 rows.
* What are the pandas datatypes of each column? For any of the "object"s, can you state what the Python data type is?

In [None]:
news = pd.read_csv("https://raw.githubusercontent.com/Greg-Hallenbeck/HARP-210-NLP/main/pandas/news.csv")

___
### Selecting Rows and Columns

Select the GooglePlus column.

Select both the Headline and the GooglePlus column.

Select all of the Title and Headlines where the topic is "obama"

Select all of the Titles and # of Facebook likes for the articles which received at least 20k likes.

I'm trying to select the Source and Topics columns. What am I doing wrong?

In [None]:
news["Source", "Topics"]

I'm trying to find all of the articles written by USA Today. What am I doing wrong?

In [None]:
news[ news["Source"] == "USA TODAY" ]

___
### Summary Functions

I want to calculate the maximum number of Facebook likes of all articles. What am I doing wrong?

In [None]:
news.max("Facebook")

There are four topics in the dataset: Obama, the economy, Palestine, and Microsoft. How many articles are there for each topic?

What are the maximum and minimum number of likes on Facebook across all these articles?

What is the average number of likes on Facebook across all these articles?

You should have seen that some articles have -1 likes. This means they weren't shared at all. What is the average number of likes on Facebook across all articles, if we remove all of the articles with -1 likes?

What's the average number of LinkedIn "likes" for articles with the topic "palestine"? Make sure that you only include articles with at least 0 likes, as above.

What are the 3 most common sources in the dataset?

I want to calculate the average title sentiment for all Reuters articles. What am I doing wrong?

In [None]:
news["SentimentTitle"].mean().loc[ news["Source"] == "Reuters" ]

What is the title of the article with the most LinkedIn likes?

___
## Day 2: Data Manipulation

### Grouping and Summarizing

What is the average number of Facebook likes for each of the four main topics in the dataset?

Make an indicator column called `Shared` indicating whether a story was shared on Facebook or not (you can tell if it's shared because the number of likes is 0 or higher).

Using only the articles which were shared, what was the average number of likes for each topic?

Make a new column called `SocialMedia` which is equal to the 3 social media "likes" added together.

**Fun Problem.** Make a new column called `MaxMedia` which is equal to the *highest* of the Facebook, Google+, and LinkedIn number of likes. You'll have to look at the [`.max()` function documentation](https://pandas.pydata.org/docs/reference/api/pandas.Series.max.html) for this.

___
### String Manipulation Functions in Pandas

Convert all of the titles to be entirely lowercase.

What is the average length of a title, in characters? How long in words (roughly)?

What is the title of the article with the longest (by characters) title?

Create an indicator column that indicates whether or not Obama is mentioned in the title of an article. How frequently is he name-dropped in articles about him? In articles about Palestine?

Remove Finland from all of the article titles (i.e. replace the country name with "")

**Fun Problem.** Make a new column, called `firstword`, which is the first word in each title.

___
## Day 3: Data Visualization with Seaborn

___
### Histogram

* Make a histogram showing the title sentiment value
* Set the bin width to 0.1.
* Add descriptive axis labels.
* I've often found that the standard sentiment-labelling techniques just sort of give you random values, with most sentiment values very close to 0. Does this seem to be the case for this dataset?

___
### Scatterplot

* Create a `Length` column in your dataset that is the length of a title, in letters.
* Plot the length of a title on the x axis and the number of likes on Google+ on the y.
* Remove the outliers from the plot by adjusting the y limits to be from 0 to 250.
* Add labels to the axes.
* What trends do you notice in the popularity of posts?

___
### Countplot

* Make a countplot showing how many articles are in the dataset from each of the sources.
* Rotate the x axis labels by 90 degrees so that they're readable.
* Add descriptive axis labels.
* Add a hue to the plot so we can see how often each source reports on each topic.
* Figure out how to stack the plots instead of dodging by [reading the documentation](https://seaborn.pydata.org/generated/seaborn.countplot.html)
* Comment on the results. Why do the different sources report on different topics?

___
### Boxplot

* Make a boxplot showing the number of likes each topic gets on Facebook.
* Change the y scale to "log" using `plt.yscale("log")`. Make sure you can read the axis.
* Use the `order=` keyword argument to put the plots in order based on the median number of likes.
* Add descriptive labels on the axes.
* Comment on the result. Do different topics typically have different popularities?

___
### Additional Plots

Load in the Netflix dataset we worked through in class, and produce at least 3 more data visualizations, and comment on them. What information should a viewer get from them?