# Report

## Data Wrangling steps:

1. Data Gathering
2. Data Assessment
3. Data Cleaning
4. Data Visualization

## Project Details

Each step was carried out the following way:

#### 1. Data Gathering

- I first opened the twitter archive enhanced CSV file (previously given to us by the Udacity team) with pandas `read_csv` function.
- I then downloaded the image prediction data programmatically with the `requests` library.
- I then gathered the Twitter API Data by querying Twitter's API for JSON data for each tweet ID in the Twitter Archive. I used Python's `Tweepy` library.

#### 2. Data Assessment

- First I looked at the three tables to see the problems I could find just by seeing them. I found 5 quality problems and one tidiness problem. All of them were registered to fix them later.
- I then made a programmatic assessment of the tables using `info()`, `value_counts()`, `describe()`, `duplicated()` and `any()`. I found 4 quality problems and 1 tidiness problem. All of them were registered to fix them later.
- I then wrote down all the problems in one cell to have them all together. They are as follows:

### **Quality**

- `df_twit_arch`

  - Following columns have NaN values: 
    - `in_reply_to_status_id`
    - `in_reply_to_user_id`
    - `retweeted_status_id`
    - `retweeted_status_user_id`
    - `retweeted_status_timestamp`
  
  - *floofer* should be *floof*
  
  - Dog stage *floofer* should be *floof*
  
  - Dog stages (`doggo`, `floofer`, `pupper`, `puppo`) should be in one column (such as one named `stages`)
  
  - `timestamp` should be *datetime64* dtype.
  
  - In the `name` column there are a lot of missing values (*None*) and some values that don't make sense (like *a*).
  
  - The following columns have missing values (less than 2354 values):
    - `doggo` (2259)
    - `floofer` (2346)
    - `pupper` (2099)
    - `puppo` (2326)

  - `tweet_id` should be *object* dtype.
  
- `images`

  - Following columns have uppercase and lowercase values (inconsistently)
    - `p1`
    - `p2`
    - `p3`
    
- `tweet_count`

  - `id_str` is not a clear name, and it's different from the one in the other table.

### **Tidiness**

- `tweet_count` should merge with `df_twit_arch`. The data in those tables is related.

- `df_twit_arch`

  - `source` has too much information in it, and it's too cluttered. 

#### 3. Data Cleaning

- I first created duplicates of the tables with `copy()` to work with them exclusively.
- I then proceeded to clean the datasets in the following way:
  - Removed *retweets*, *replies* to original tweets and *replies* to replies. 
  - Changed the column name `floofer` to `floof`.
  - Replaced all instances of `floofer` to `floof`.
  - Replaced the different dog *stages* columns (`doggo`, `floof`, `pupper`, `puppo`) that had *None* values with empty strings.
  - Put all dog stages (`doggo`, `floof`, `pupper`, `puppo`) inside a new `stages` column.
  - Changed the `timestamp` dtype to `datetime64`.
  - Converted all the missing names and names that didn’t make sense to *NaNs*.
  - Changed `tweet_id` dtype to `object` dtype.
  - Converted all the names from `p1`, `p2` and `p3` to lowercase.
  - Changed the column name `id_str` to `tweet_id`.
  - Removed *HTML tags* from the `source` column.
  - Saved the *sources* and *urls* from `source` in new columns.
  - Merged `tweet_count` and `df_twit_arch`.
  - Merged the new dataframe and images into a *master dataset*.


#### 4. Data Visualization

- I asked the following questions and got the following insights:
  - Are retweets and favourites related?
    - They are positively correlated. Most values are between 0 and 10k retweets, and between 0 and 40k favourites. There are also some very large outliers.
  - What are the ten most frequent dog breeds?
    - From prediction 1:
      - Golden Retriever, Labrador Retriever, Pembroke, Chihuahua, Pug, Chow, Samoyed, Pomeranian, Toy Poodle, Malamute.
    - From prediction 2:
      - Labrador Retriever, Golden Retriever, Cardigan, Chihuahua, Chesapeake Bay Retriever, French Bulldog, Pomeranian, Toy Poodle, Siberian Husky, Cocker Spaniel. 
    - From prediction 3:
      - Labrador Retriever, Chihuahua, Golden Retriever, Eskimo Dog, Kelpie, Kuvasz, Chow, Staffordshire Bull Terrier, Beagle, Toy Poodle.
  - What are the ten most common dog names? 
    - Charlie, Lucy, Cooper, Oliver, Penny, Tucker, Sadle, Winston, Daisy and Lola.


#### Insights:

- Favourites count and Retweets count are positively correlated.
- The most frequent dog breed in p1 is Golden Retriever.
- The most frequent dog breed in p2 is Labrador Retriever.
- The most frequent dog breed in p3 is Labrador Retriever.
- The most common dog name is Charlie (male), followed by Lucy (female). 