# Wrangle Report

This report briefly describes my wrangling efforts while doing the project (saved as 'wrangle_act.ipynb').

Data Wrangling consists of:
- data gathering
- data assessing
- data cleaning

## Data Gathering

I gathered data from 3 different sources:
- 'twitter-archive-enhanced.csv', which contains the WeRateDogs Twitter archive, was manually downloaded from Udacity server and read in a dataframe.
- Image predictions file was downloaded programmatically using the Requests library and URL 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv' and saved into second dataframe.
- Additional data (favorite and retweet count) was downloaded by quering the Twitter API using Python's Tweepy library. Then it was imported from Twitter Query as json file and saved into a new dataframe with 3 columns (id, favorite_count, retweet_count), each tweet stored in a line.

## Data Assessing

I assessed each dataframe first manually, then programatically, using different methods and functions:
- **.info()** method to assess null values and datatypes of columns
- **selecting data by index** and **by column names** to access part of the data
- **df[Series.isnull()]** and **df[Series.notnull()]** to select rows that contain or don't contain null values
- **.shape** to see number of rows and columns in the dataframe
- **.duplicated()** to check for duplicated rows in the dataframe
- **.describe()** to check for suspicious data and outliers
- **.value_counts()** to get counts of unique values in the Series
- **.head()** to access only first 5 rows of a selected df

After assessing I found following problems:

#### Quality Issues
- Missing values in some columns (in_reply_to_status_id, in_reply_to_user_id, retweeted_status_id, retweeted_status_user_id, retweeted_status_timestamp, expanded_urls)
- Erroneous datatypes (timestamp, retweeted_status_timestamp)
- Unnecessary rows with replies to other people's tweets
- 181 rows containing retweets (not original ratings)
- Unexpected values in rating numerators and deminators
- Tweet with ID `835152434251116546` rated 0/10 for plagiarism
- Missing data (as None) in most rows of dog stages and dog names
- Erroneous dog names (a, an, the) where a dog name is absent in a tweet
- Inconsistency in breed names (some lowercase, some capitalized)

#### Tidiness Issue
- Rating numerators and denominators should be one variable
- Dog stages (doggo, floofer, pupper, puppo) in 4 columns instead of 1
- Inconsistency in a column name (tweet ID) among tables
- Three dataframes instead of one

After reassessing were detected 2 more issues:

#### Quality
- Some dogs were not recognized (e.i. index 0: orange, bagel, banana)

#### Tidiness
- 3 predictions should be narrowed to the most possible one

## Data Cleaning

First of all, I created copies of 3 dataframe.
Then I cleaned data in the following sequence:


1. Missing data
    - Not-null rows in 'in_reply_to_status_id', 'in_reply_to_user_id', 'retweeted_status_id', 'retweeted_status_user_id', 'retweeted_status_timestamp' columns were dropped and then these columns were removed completely
    - Null-values in 'expanded_urls' were dropped
    - None values in dog stages and names columns were replaced by Nulls
    

2. Tidiness issues
    - 4 columns doggo, floofer, pupper, puppo were turned into one column (Dog stages) using .ffill() method
    - Column 'id' was renamed into 'tweet_id' in df3
    - All three dataframes were combined into one, using pandas merge function
    - Rating numerators and denominators were turned into one rating variable (after performing cleaning of quality issues, i.e fixing numerators and denominators)
    
    
3. Quality issues
    - Datatype of 'timestamp' column was converted to datetime
    - Observation with id '835152434251116546' was dropped
    - All  erroneous dog names were found and replaced with NaN
    - All dog breed were converted lowercase
    - All suspicious rating numerators and denominators were replaced by correct ones when it was possible
    
    
4. After that, I assessed dataframe again and cleaned 1 remaining quality and 1 tidiness issue:
    - Rows that had three Falses in columns 'p1_dog', 'p2_dog', 'p3_dog' were removed, as the dog breed was not detected
    - 3 predictions of dog breed were narrowed to the most possible one
    
All coding was followed by testing to make sure that the code was correct.

And I stored final cleaned dataset into 'twitter_archive_master.csv'.