# Wrangle and Analyze Data Project ( The tweet archive of Twitter user  'WeRateDogs' )

## Wrangling Steps

by Hassan Moharram

Data: February 13, 2019

### Gather Data

We have three files to gather:   
1- twitter-archive-enhanced.csv   
2- tweet_json.txt    
3- image-predictions.tsv    

- The first file (twitter-archive-enhanced.csv):   
This file is manually downloaded.

- The second file (tweet_json.txt):    
Each tweet's retweet count and favorite ("like") count at minimum, and any additional data you find interesting. Using the tweet IDs in the WeRateDogs Twitter archive, query the Twitter API for each tweet's JSON data using Python's Tweepy library and store each tweet's entire set of JSON data in a file called tweet_json.txt file. Each tweet's JSON data should be written to its own line. Then read this tweet_json.txt file line by line into a pandas DataFrame with tweet_id, favorites 	retweets, user_followers, user_favourites, and date_time.


- The third file (image-predictions.tsv):   
 This fileis hosted on Udacity's servers and should be downloaded programmatically using the Requests library and the following URL: https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv

### Assess Data

### Quality

#### `twitter_archive_enhanced` table:
- There are some not null values in 'retweeted_status_id', 'retweeted_status_user_id', and 'retweeted_status_timestamp' columns.
- There is no need for columns like:'in_reply_to_status_id', 'in_reply_to_user_id', 'retweeted_status_id', 'retweeted_status_user_id', and 'retweeted_status_timestamp'.
- 'name' column should be changed to a more suitable name.
- Erroneous data types(in_reply_to_status_id ,in_reply_to_user_id, timestamp, retweeted_status_id, retweeted_status_user_id, retweeted_status_timestamp, and 'source')
- Missing values in (in_reply_to_status_id, in_reply_to_user_id, retweeted_status_id, retweeted_status_user_id, retweeted_status_timestamp, expanded_urls)
- some values in (text) starting with 'RT @dog_rates: ' not the main text content.
- Null values in 'doggo', 'floofer', 'pupper', and 'puppo' are represented in None instead of NaN.
- Null values in 'name' are represented in None instead of NaN.
- Lowercase and uppercase given 'name'.
- Invalid names values in 'name' like; None, a, an, the,his,and my.

#### `tweet_json` table:
- Total number or tweet_id are 2340 instead of 2356.

#### `image_predictions` table:
- p1,p1_conf, p1_dog, p2, p2_conf, p2_dog, p3, p3_conf, and p3_dog should be represented into 2 columns; 'dog_breed' and 'confidence_level'.
- Total number or tweet_id are 2075 instead of 2356.
- There are 66 duplicated 'jpg_url'.
- Lowercase and uppercase given 'p1', 'p2', and 'p3'.
- Erroneous data types('p1', 'p2',and 'p3')

- timestamp' and 'date_time' in `twitter_archive_enhanced` and `tweet_json` tables repectively, have the same values, but are with diffenrent names.

### Tidiness

- `twitter_archive_enhanced` table: 'doggo', 'floofer', 'pupper', and 'puppo' columns should be represented into one column 'dogtionary'.
- All the three tables should be represented in one whole table.

### Clean Data

#### Quality

##### There are some not null values in 'retweeted_status_id', 'retweeted_status_user_id', and 'retweeted_status_timestamp' columns.

##### Define
Drop rows with those not null values.

#### Quality

##### `twitter_archive_enhanced` table: there is no need for columns like: 'in_reply_to_status_id', 'in_reply_to_user_id', 'retweeted_status_id', 'etweeted_status_user_id',  and 'retweeted_status_timestamp'

##### Define
Drop 'in_reply_to_status_id', 'in_reply_to_user_id', 'retweeted_status_id', 'retweeted_status_user_id', and 'retweeted_status_timestamp'

#### Tidiness

###### `twitter_archive_enhanced` table: 'doggo', 'floofer', 'pupper', and 'puppo' columns should be represented into one column 'dogtionary'.

#### Define
Replace 'None' values in 'doggo', 'floofer', 'pupper', 'puppo' columns with ''. Then thum up those 4 columns into one column called 'dogtionary'. Then drop them. Then replace '' values in 'dogtionary'

#### Quality

##### `image_predictions` table: p1 ,p1_conf, p1_dog, p2, p2_conf, p2_dog, p3, p3_conf, and p3_dog should be represented into 2 columns; 'dog_breed' and 'confidence_level'.

#### Define
Construct the following for loop through each row in `image_predictions_clean` table and creating 2 lists; dog_breed and confidence_level, Then adding them to `image_predictions_clean` table as 2 columns. Then dropping 'p1' ,'p1_conf', 'p1_dog', 'p2', 'p2_conf', 'p2_dog', 'p3', 'p3_conf', 'p3_dog' columns.

###### All the three tables should be represented in one whole table.

#### Define
Merge the `twitter_archive_enhanced_clean` table to the `tweet_json_clean` table, joining ontweet_id. Then Merge the resulting `twitter_archive_enhanced_clean` table to the `image_predictions_clean` table, joining ontweet_id.

#### Quality

##### 'timestamp' and 'date_time' in `twitter_archive_enhanced` and `tweet_json` tables repectively, have the same values, but are with diffenrent names.

#### Define
Drop 'date_time' column.

#### Quality

##### 'name' column should be changed to a more suitable name.

#### Define
Change the name of 'name' column to dog_name

#### Quality

##### Erroneous data types('source', 'dogtionary', 'dog_breed', and 'timestamp')

#### Define
Convert 'source', 'dogtionary', and 'dog_breed' to categorical data type and 'timestamp' to datetime data type.

##### Erroneous data types(in_reply_to_status_id ,in_reply_to_user_id, timestamp, retweeted_status_id, retweeted_status_user_id, retweeted_status_timestamp, and source)

#### Define
I already dropped in_reply_to_status_id ,in_reply_to_user_id, retweeted_status_id, retweeted_status_user_id, and retweeted_status_timestamp columns. timestamp and source columns are already converted to the right data types in the previous step.

##### Too much missing values in (in_reply_to_status_id, in_reply_to_user_id, retweeted_status_id, retweeted_status_user_id, and retweeted_status_timestamp)

#### Define
I already dropped these columns previously, so there is no need for them now.

##### Some values in (text) starting with 'RT @dog_rates: ' not the main text content.

#### Define
Check each taxt starts with 'RT @dog_rates: ' and replace that with ''. Then append all modified texts to a list called text_modified. Then drop the text column. Then add the text_modified list as a column to twitter_archive_enhanced_clean.

##### Null values in 'doggo', 'floofer', 'pupper', and 'puppo' are represented in None instead of NaN.

#### Define
I already solved this problem

##### Null values in 'name' are represented in None instead of NaN.

#### Define
Replace all 'None' values in dog_name column with 'NaN'

##### Lowercase and uppercase given 'name'.

#### Define
Make all names in dog_name column lowercase.

##### Invalid names values in 'name' like; None, a, an, the,his,and my.

#### Define
Remove all the invalid names and replace them with the right ones.

##### Total number or tweet_id are 2340 instead of 2356 in `tweet_json` table

#### Define
This was already solved earlier by merging `tweet_json` table with `twitter_archive_enhanced_clean` table.

##### Total number or tweet_id are 2075 instead of 2356 in `image_predictions` table

#### Define
This was already solved earlier by merging `image_predictions` table with `twitter_archive_enhanced_clean` table.

##### Lowercase and uppercase given 'p1', 'p2', and 'p3'.

#### Define
This was already solved earlier.

##### Erroneous data types('p1', 'p2',and 'p3')

#### Define
This was already solved earlier.