#  <font size = "12" color = lightblue> <div align="center"> Wrangle Report </div> </font> 

# Introduction

Wrangling and analyzing data is one of the projects I get to do as a student in the Udacity Data Analyst Nanodegree program.

Specifically, my tasks in the project are as follows:

* Data wrangling, which consists of:
 * Gathering data 
 * Assessing data
 * Cleaning data
* Storing, analyzing, and visualizing your wrangled data
* Reporting on 
 * your data wrangling efforts 
 * your data analyses and visualizations

Thus this report reports my data wrangling effort.

###  What data do we have?

The data we wrangle and analyze in this project comes from the twitter archive of twitter user [@dog_rates](https://twitter.com/dog_rates). Since it is a real-world data, it sure is dirty. 

### What software?

To wrangle WeRateDogs Twitter data, I’m going to use Python to wrangle the data. We’ll use
* pandas,
* Numpy,
* request
* tweepy and 
* json 
libraries to help build up our dataframe.

I’m using a Jupyter Notebook to code in and and you can get the finished product from my Github repo.

**Let's take a look at how these packages are imported**

In [None]:
#Import all packages needed
import matplotlib
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import requests
import tweepy
from tweepy import OAuthHandler
import json
from timeit import default_timer as timer
import re

#   Data wrangling steps

## 1) Gathering data:

There are three pieces of data to be gathered.

* **The WeRateDogs Twitter archive**: this was manually downloaded by clicking the twitter_archive_enhanced.csv in Udacity resource page
* **The tweet image predictions** (image_predictions.tsv) that tells us what breed of dog is present in each tweet according to a neural network. The is hosted on Udacity's servers and was downloaded programmatically using the Requests library and the following URL: [image_predictions.tsv](https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv)
* **Additional Data via the Twitter API**: I will query Twitter's API to gather valuable data such as retweet count, favorites count. 




## 2) Assessing Data

Here we identify issues with our data. This identification can be done 
* **Visually**; by opening the data in a spreadsheet. I first opened the data in an excel spreadsheet.
* **Programmatically**; by writing codes that call some functions in Pandas such as
 * df.head()/tail() : would print first/last 5 lines in the data set
 * df.describe: would print statistical properties ofthe qualitative variables.
 * df.info(): would print information about data types, number of rows and colums, Null and Non-Null variables, etc.

#### List of issues in dataset 

### Quality issues:

* Duplicates in the image_prediction table
* Outliers in numerator rating
* Missing values in dog stage column.
* Inaccurate entery of dog names
* Mising images in some tweets
* We are interested in only original tweets, so there are 181 retweets to be deleted
* Incorrect datatypes of some variables.

### Tidiness issues:

* Merging the three pieces of data into a single dataframe
* The different dog stages were each represented in a column. We need to melt all stages into a single column callled **Dog_stage**
* Deleteing columns that are not useful for analysis.

## 3) Cleaning data

We have assessed our data and listed certain issues in the data that would require cleaning. To actually perform cleaning, the following steps are considered:

* **Define**: state the issue to be cleaned and how we will go about cleaning it.
* **Code**: write down the code that does the cleaning operation
* **Test**: Test your code to see that the defined cleaning operation was performed correctly.

# Conclusion

At the of the data wrangling process, the following reports were put together to convey our observations and procedures;

* **wrangle_act.ipynb**: complete documentation of codes, data wrangling processes, analysis and visualizations.
* **act_report.html**: documentation of insights and visuals to final (cleaned) data
* **wrangle_report.html**: documentation of data wrangling steps


In [2]:
from subprocess import call
call(['python', '-m', 'nbconvert', 'wrangle_report.ipynb'])


0