## Reporting: wragle_report

# Introduction 

Data wrangling operations were carried out for WeRateDogs tweets.

The following wrangling operations were performed:
- Data Gathering 
- Assessing data
- Data Cleaning


# Data Gathering 
### step 1:
The tweet archive of Twitter user [@dog_rates](https://twitter.com/dog_rates), also known as [WeRateDogs](https://en.wikipedia.org/wiki/WeRateDogs) was manually downloaded from the [link](https://d17h27t6h515a5.cloudfront.net/topher/2017/August/59a4e958_twitter-archive-enhanced/twitter-archive-enhanced.csv).

### step 2:
Using python request library, another additional dataset which is the `The tweet image predictions` was downloading from the [link](https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv)

### step 3:
Furthermore, additional records for the archive data was collected using the twitter API calls. The additional data collected was the number of retweets for each tweet in the archive and their number of favorite counts.

# Assessing Data
Visual assessment and Programmatic assessment was carried out on the dataset gathered.

### Visual assessment:
For the visual assessment, the three collected data was displayed in jupyter notebook using the sample function to get 10 random rows from the dataset to view for possible errors in the dataset.

### Programmatic assessment:
For the Programmatic assessment, pandas dataframe function such as `df.info()`, `df.unique()` and `df.value_counts()` were used to assess the gathered data for possible errors in the dataset.

### assesment findings:
From the Visual assessment and Programmatic assessment carried out, the following data quality and data tidiness issues were discovered:
- incorrect records.
- missing records .
- wrong datatype format.
- inadequate descriptive column name.
- multple information contained in a single column.

>> Quality issues
1. Invalid name entry (none, such, a, not, just, my, all, old, the, by)
2. Erroneous datatypes (timestamp, in_reply_to_status_id,in_reply_to_user_id, retweeted_status_id, retweeted_status_user_id, retweeted_status_timestamp). These ID fields, like tweet_id, in_reply_to_status_id etc. should be objects, not integers or floats as they are not numeric and not intended for calculations 
3. Retweets need to be removed as they may otherwise skew the result of your analysis
4. missing records in in_reply_to_status_id ,expanded_urls ,in_reply_to_user_id, retweeted_status_id and retweeted_status_timestamp 
5. None values in doggo', 'floofer', 'pupper' and 'puppo' columns
6. source column in `twitter_archive` table should be as source link and source title not combined in source column
7. inadequate descriptive column names in `image_pred` table.
9. Ratings: The rating_numerator and rating_denominator column should be float and also it should be correctly extracted as some of the ratings were not extracted correctly.
9. p1_conf, p2_conf and p3_conf records should be in percentage
10. prediction contains underscore characters

>>Tidiness issues
1. text column in `twitter_archive` table should be splitted into about and link, removing rating
2. `tweet_stat` table should be merged into `twitter_archive` table to make the records unified for analysis 

# Clean Data
To address the data quality and tidiness issues identified, the following were carried out to clean the dataset:
- Dropped all rows with dog name starting with lowercase using islower string function
- Drop all rows where retweeted_status_id or retweeted_status_id or retweeted_status_id is not null
- convert timestamp and retweeted_status_timestamp from string format to Datetime.
- convert tweet_id, in_reply_to_status_id,in_reply_to_user_id, retweeted_status_id and retweeted_status_user_id to object(String) format
- Drop the following columns in_reply_to_status_id, in_reply_to_user_id, retweeted_status_id and retweeted_status_timestamp columns using the pandas drop function.
- Drop doggo', 'floofer', 'pupper' and 'puppo' columns as values as None and would not be needed for this analysis questions
- Multiply p1_conf, p2_conf and p3_conf by 100 to convert to percentage
- removed "_" characters in prediction name using replace function
- extracted the rating values from the text column.
- extracted link from text column in twitter_archive table using regular expressions and pandas' str.extract method. rename text column when done.
- splited source into source URL and Source title using beautiful soup. Drop source colum after the split
- Added descriptive column names in image_pred table.
- Changed p1 to prediction_1, p2 to prediction_2, and p3 to prediction_3
- Changed p1_conf to prediction1_confidence(%), p2_conf to prediction2_confidence(%), and p3_conf to prediction3_confidence(%)
- changed p1_dog to prediction1_dog_type, p2_dog to prediction2_dog_type and p3_dog to prediction3_dog_type
- Merged the retweet_count and favorite_count column to the twitter archive table, joining on tweet_id.

# Storing and Anaylsis
The cleaned data was stored as csv files and exploratory analysis performed on the cleaned dataset for insight.
