# Data Wrangling Project

## Internal Report

--------------------------

## Abstract

In this report I will present the main steps I have performed to wrangle the **WeRateDogs** dataset. This is meant to be an overview of the main cleaning and analysis issues I have treated. For technical details please refer to the `wrangle_act` jupyter notebook.

-----------------

## Steps taken

My treatment of this report consists of five main steps:

1. Gathering data.

2. Assessing the dataset.

3. Cleaning the dataset.

4. Providing some analysis and insights.

### Gathering data

In the Gathering step, I performed the following:

- Downloaded the 'twitter-archive-enhanced.csv' file using the link provided in the project page.

- Downloaded the 'image_predictions.tsv' file programmatically using the **Requests** library in python.

- Used Tweepy api to get supplimentary tweet information such as favorite count and retweet count.

After this I used Pandas methods to import CSV, TSV and to parse Json and create the dataframe: `tweet_details_df`, `tweet_arch_df` and `image_pred_df`.

### Assessing data

Here are the assessing issues I have selected for cleaning:


#### Quality issues:

1- **`tweet_details`** table: 

- The `favorite_count` column should have an integer datatype rather than an object datatype.
- The `retweet_count` column should have an integer datatype rather than an object datatype.
- The `created_at` column should be of date/time dataype rather than string.
- The `quoted_status_id` column should be of string datatype rather than floating point.
- Missing data (2220 values instead of 2356) in the `possibly_sensitive` and the `possibly_sensitive_appealable` columns and convert the column datatypes to categorical

2- **`tweet_arch`** table:

- The `tweet_id` column should be of a string datatype rather than integer.
- The `timestamp` column should be of datatype datetime rather that a string.


3- **`image_pred`** table: 

- The `p1` column has underscores between individual words, this should be replace by a space.
- The `p2` column has underscores between individual words, this should be replace by a space.
- The `p3` column has underscores between individual words, this should be replace by a space.
- The `tweet_id` column should be of a string datatype rather than integer.

#### Tidiness issues

- The columns `in_reply_to_user_id`, `source`, `in_reply_to_status_id` are redundant in the `tweet_arch` and the `tweet_details` tables.

- The columns `doggy`, `floofer`, `puppy` and `puppo` in the `tweet_arch` table should be all melted into a `dog stage` column.

- The column `source` in the `tweet_details` table has an untidy format. Hyperlinks are in HTML format with marker <a></a>. Instead the source column should directly include hyperlink strings.

I had to perform a second assessing iteration too. Here are the issues of the second iteration:

#### Quality issues:


- The `tweet_id` column in the `dog_stages` dataframe should be of a string datatype not an integer.

#### Tidiness issues

- The `tweet_details_clean` and the `tweet_arch_clean` dataframes can be merged into a single data frame

All of these issues were resolved in the **Cleaning step**.

### Cleaning data

In this step I handled each issue I designated in the assessment step. I started by resolving missing data issues, then tidiness issue and finally the remaining quality issues. This order was recommended by the lessons. Each cleaning operation consists of a **Define** step, a **Code** step and a **Test step**.

### Analysis and Insights

In this section, I posed three questions and answered them using visualizations and some statistic methods. Here are the questions:

- Q1: what is the distribution of rating numerators for those with a denominator of 10?

- Q2: what are the highest rated dog stages?

- Q3: what are the five dog types that are more frequently recongnized in the image prediction task?

To check out my answers to these questions, please refer the the `wrangle_act` jupyter notebook or the `act_report` report.