## Reporting: wrangle_report
#### By: [Samuel Duah Boadi](https://www.linkedin.com/in/samuel-duah-boadi-8ab46944/)

## Introduction

Data wrangling is the process of gathering data, assessing the data’s quality and structure before cleaning it. In this project, all the three steps in data wrangling, that is, gathering, assessing and cleaning.

This report describes the wrangling efforts done.

## Gathering Data

In the first step, gathering data: the three datasets were collected through three different methods.

> **Download manually**<br/>
The twitter_archive_enhanced.csv was acquired by manually downloading the file through the link provided. Once the file was downloaded and kept in the appropriate folder it was read into a pandas dataframe, naming it as twitter_arch.

> **Download programmatically**<br/>
The file image_predictions.tsv which was hosted on the Udacity’s servers was downloaded programmatically using the request library and read into a pandas dataframe, naming it as image_pred.

> **API**<br/>
An additional data containing retweet count and favorite count of each tweets in the twitter archive was needed. To get that data, I used the tweet IDs in the archive to query the Twitter API using tweepy library.

## Assessing Data

The second step was assessing the quality and structure of the three dataframes. I scrolled through the datasets visually which was in some cases not effective. Searching for issues programmatically using code was helpful. 

Some functions in pandas that was used to assess the datasets were shape to see the number of columns and observations, info(), value_counts(), duplicated() and describe(). 

Issues identified were documented at the bottom of the Assessing Data section. The issues were grouped into two; Quality issues and Tidiness issues.

#### Quality issues
> `twitter_arch` table<ul>
    <li>tweet_id is an integer not a string</li>       
    <li>in_reply_to_status_id, in_reply_to_user_id, retweeted_status_id and retweeted_status_user_id are floats not strings</li>
    <li>timestamp and retweeted_status_timestamp are strings not datetimes</li>
    <li>doggo, floofer, pupper and puppo are strings not booleans</li>
    <li>name column has wrong dog names</li>
    <li>Some columns in the dataset not needed for the analysis</li>
    <li>A row has rating_denominator equal to 0</li>
</ul>

> `image_pred` table<ul>
    <li>tweet_id is an integer not a string</li>
    <li>Underscore (_) between words in p1, p2 and p3 columns</li>
    <li>Inconsistent case in p1, p2 and p3 columns</li>
</ul>

> `tweet_json` table<ul>
    <li>tweet_id is an integer and not string</li>
    </ul>


#### Tidiness issues
<ul>
    <li>`tweet_json` table should be part of the `twitter_arch` table</li>
    <li>`image_pred` table should be part of the `twitter_arch` table</li>
    <li>Three columns (doggo, pupper and puppo) instead of one column 'dog_stage</li>
</ul>

## Cleaning Data

The final step in Data Wrangling is cleaning. In this step, all of the issues documented while assessing would be cleaned. Before the cleaning, a copy of the original data was made in order not to make changes to the original data. 

The programmatic data cleaning process was followed, that is, define, code and test.<br/>
How to clean the issue was defined then converted the definition into executable code and finally test the data to ensure the code was implemented correctly.

The cleaned dataset was save to csv file named twitter_archive_master.csv
 

### Limitation
Based on the definition, dogs were classified either as doggo, pupper or puppo. Floofer is any dog with 'seemingly excess fur'. Some observation had multiple dog stages (doggo, pupper, puppo and floofer). It was imposiible to create a 'dog_stage' column stating only one stage of the dog.

A number of tweets did not include the names of the dog, hence prompted the use of 'None'in those cases.

The image classifier predicted some images as not dog when in some cases there were actually dogs in the images. 