# Exploratory Data Analysis

Understanding our data is a crucial step in any data science project. In this assignment, you will perform an exploratory data analysis of corpus of documents. The corpus consists of 11,587 documents, each of which is a news article. The documents are stored in a folder called `data` in the root of the repository and the data is stored in the shared google drive `datasets`.

In this assignment, you will analyze a corpus of news documents to answer the following questions:

0. What is the nature of our data?
    - 0a. What is the size of the corpus?
    - 0b. Are there any duplicates in the corpus? If so, drop them.
    - 0c. Are there any missing values in the corpus?
    - 0d. How many unique documents are there in the corpus?
1. What is the distribution of `token`s per document?
    - 1a. What is the longest article?
    - 1b. What is the shortest article?
    - 1c. What is the 95th percentile of article lengths?
2. How many different sources are there in the corpus?
    - 2a. How many different sources are there in the dataset?
    - 2b. What is the distribution of articles per source?

In [None]:
## Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm.auto import tqdm

tqdm.pandas()

## News Corpus

You are provided the following news corpus: `data/news_corpus.csv`. The corpus contains the following columns:

- `index` int: The unique identifier of the document.
- `source` str: The source of the document
- `title` str: The title of the document
- `text` str: The content of the article

The data used in this notebook comes from the [`StoryGraph`](https://archive.org/details/storygraph?tab=about) project, created and maintained by Prof. Alexander Nwala.

```BibTeX
@MISC {nwala-cj20,
    author = {Alexander Nwala and Michele C. Weigle and Michael L. Nelson},
     title = {365 Dots in 2019: Quantifying Attention of News Sources},
     year = {2020},
      month = may,
     howpublished = {Poster/demo accepted at the Computation + Journalism Symposium (symposium cancelled due to COVID-19)},
     arxiv = {https://arxiv.org/abs/2003.09989},
     pubdate = {202005}
}
```

## Load Data into Pandas dataframe

In [None]:
df = pd.read_csv('data/news-2023-02-01.csv')
df.head(10)

## 0. What is the nature of our data?

Using your coding skills, answer the following questions. Please comment on your code and results.

In [None]:
## 0a. What is the size of the corpus?

### YOUR CODE HERE

In [None]:
## 0b. Are there any duplicates in the corpus? If so, remove or drop them.

## YOUR CODE HERE

In [None]:
## 0c. Are there any missing values in the corpus? If so, what data are missing?
## Should the missing values be removed, explain?

## YOUR CODE HERE

## 1.0 What is the distribution of `token`s per document?

Use the `spaCy` library to tokenize the text and analyze the distribution of token frequencies. You can use the `Counter` class from the `collections` library to count the number of times each token appears in the corpus.

In [None]:
import spacy
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


## Load the spacy model: nlp
NLP = spacy.load('en_core_web_sm')


In [None]:
## Generate the tokens using Spacy

## YOUR CODE HERE

In [None]:
## Count the tokens

## YOUR CODE HERE

In [None]:
## YOUR CODE HERE
## Plot the distribution of the number of tokens per document

In [None]:
## 1a. What is the longest article?

## YOUR CODE HERE

In [None]:
## 1b. What is the shortest article?

## YOUR CODE HERE

In [None]:
## 1c. What is the 95th percentile of the number of tokens per document?
## Hint: use np.percentile

## YOUR CODE HERE

In [None]:
## 1d. What is the size of the vocabulary and the frequencies of each token in the corpus?

import string
from collections import Counter

## Create list of stopwords from spacy
stop_words = list(spacy.lang.en.stop_words.STOP_WORDS) + list(string.punctuation)

## YOUR CODE HERE
## hint: use Counter


## 2.0 How many different sources are there in the corpus?

Please describe how many different sources exist in the dataset.

In [None]:
## 2a. Plot how many different sources are there in the corpus?

## YOUR CODE HERE

In [None]:
## 2b. Plot the distribution of articles per source?
## hint: use seaborn boxplot

## YOUR CODE HERE