# Data Wrangling Report

## Introduction

In this project, I practiced going through the entire data wrangling process (gather - assess - clean) and analyzing the tweet archive of Twitter user @dog_rates, which rates people's dogs with a humorous comment about the dog. The aim of this project is to wrangle this Twitter data to create interesting and trustworthy analyses and visualizations.

## Gathering Data

For this project, I had to gather each of the following pieces of data:

- **File on Hand:** WeRateDogs Twitter archive. This file has been giving to me, so I simply downloaded it manually by clicking the link (twitter_archive_enhanced.csv) in my Udacity classroom.


- **File Downloaded Programmatically:** Tweet image predictions file (image_predictions.tsv) is hosted on Udacity's servers and I downloaded it programmatically using Requests library and the file link.


- **JSON Data Queried using Twitter API:** Using the tweet IDs in the WeRateDogs Twitter archive, I queried Twitter API for each tweet's JSON data using Python's Tweepy library and stored each tweet's entire set of JSON data in a file called tweet_json.txt file. Then I had to read this .txt file line by line into a pandas DataFrame with tweet ID, retweet count, and favorite count.

## Assessing Data

I assessed the gathered data visually and programmatically with the help of the following pandas functions:
- .info()
- .sample()
- .unique()
- .duplicated()
- .notna()
- .isna()
- .describe()
- .value_counts()
- .isin()
- .str.startswith('RT '): for getting the records which are retweets.
- .str.contains('(\d+\.\d*\/\d+)'): to check the existence of tweets with ratings containing decimal points.
- .str.contains('\d+\/10(\S*\D*\d+\/10)+'): to check the existence of tweets containing multiple ratings.

###### While assessing the data, I have addressed the following quality and tidiness issues:

### Quality Issues:

##### `Twitter Archive` Table

1. Erroneous datatypes (timestamp, in_reply_to_status_id, in_reply_to_user_id, dog stages).
2. Unneeded columns (retweeted_status_id, retweeted_status_user_id, retweeted_status_timestamp).
3. Existence of records with retweets.
4. The data in source column contains html anchor tags which makes the information in it un-understandable.
5. Existence of tweets with no images (expanded_urls), and most of them are replies to other tweets.
6. Existence of tweets with duplicated images (expanded_urls), and it appears that most of them are retweets.
7. Erroneous extracted ratings.
8. Nulls represented as the string (None) in name and dog stages columns.

##### `Image Predictions` Table

9. Inconsistence usage of capitalization in p1, p2, p3 columns & underscores and dashes between words.
10. Existence of tweets with duplicated images (jpg_url), these might be retweets. 
11. Existence of records with retweets.
12. Non-descriptive column names.

##### `JSON Data` Table

13. Existence of records with retweets.

### Tidiness Issues:

1. One variable in multiple columns for dog stages in `Twitter Archive` ('doggo', 'floofer', 'pupper', 'puppo').
2. Timestamp column in `Twitter Archive` contains two variables: date and time.
3. Image predictions columns and retweet and favorite counts should be part of `Twitter Archive` table.

## Cleaning Data

The first thing to do before starting the cleaning process is making a copy of each piece of data, in which all the cleaning operations will be conducted on these copies. 

The cleaning sequence that I performed was: multiple **Define**, **Code**, and **Test** headers, one for each data quality and tidiness issue. As for the cleaning approach, I tackled the tidiness issues first, then cleaned up the quality issues.

Several functions have been used in the cleaning process, such as:

- .str.split(): to split some columns into multiple columns (ex: timestamp, ratings after extracting).
- .str.extract: to extract the dog stage after combining the four dog stages columns.
- pd.merge(): to merge Image Prediction and json Data tables with the Twitter Archive table
- .astype(): to fix the erroneous data types.
- .drop: to drop unneeded rows and columns.
- bs().a.contents from BeautifulSoup parser: to extract the text between html anchor in source column.
- .replace(): to replace the string "None" in the name column with a null value.
- .str.findall('(\d+\.*\d*\/\d+)'): to re-extract the ratings from text column.
- re.search(): to check if a tweet contains a certain regex.
- .str.lower(): to convert the text in p1, p2, and p3 to lowercase.
- .rename(): to rename the columns from Image predictions table.

and much more..

I faced some difficulties while cleaning up the erroneous extracted ratings since:
- some ratings are out of scales different than 10.
- some ratings contain decimal part.
- some extracted values don't represent ratings.
- some tweets contain multiple rating values.

So, working on solving that issue consisted of several steps including re-extract the ratings from text column then fixing each one of these sub-issues separately. 

As for the rest of the issues, cleaning them was kind of straightforward.
     
## Storing Data

Finally, the cleaned DataFrame was stored in a CSV file named twitter_archive_master.csv to be used later in the analysis and visualization.
