## Reporting: wrangle_report

An overview of the wrangling process in the wrangling project.

### Part One: Importing Data

Three datasets had to be imported.

1. The image predictions file was a tsv file and was imported using pd.read_csv.
2. The twitter_archive_enhanced file was provided by Udacity and imported and unzipped using Python libraries.
3. The additional Twitter data, tweet_json, required the json Python library and then was converted into a pandas dataframe by writing each JSON line one by one into the dataframe. The data was provided by Udacity as I could not obtain a login to a Twitter developer's account.

### Part Two: Identifying Issues

The issues were identified concerning the last two files.

First the files had to be merged into one. This was so that retweet count and favourite count in particular count could be analyzed, among other indicators.

Hence first the files were merged on the 'tweet_id' column, using pd.merge(), once the relevant columns were renamed to have the same name. The merged file minus the irrelevant columns (more details below) was named "twitter_relevant_data".

The issues are listed below and the steps taken to address each are clarified beneath:

1. Retweeted images are present in twitter_archive_enhanced.

*Steps: the retweeted tweets were dropped using pandas.DataFrame.drop().

2. Non-descriptive headers in twitter_archive_enhanced.

3. Non-descriptive headers in tweet_json.
    
*Steps: Both the above were cleaned after the two files were merged and irrelevant columns dropped. The only relevant non-descriptive header left was 'lang', which was renamed to 'language'. This column showed the language in which each tweet was written.

4. The relevant variables favourite count, retweet count and tweet id do not have the same names in tweet_json and twitter_archive_enhanced.

*Steps: These were renamed using pd.DataFrame.rename().

5. Timestamp column in twitter_archive_enhanced is unparsed.

*Step: the 'created_at' column was dropped so only the 'timestamp' column was used. Then pd.to_datetime was used to convert all timestamps to the pandas datetime format. This was checked using DataFrame.info().

6. Created_at column in tweet_json is unparsed.

*Step: Same as above.

7. Column names timestamp and created_at do not match.

*Step: As above, 'created_at' was dropped as it is identical to 'timestamp'.

8. Ratings data is uncorroborated.

*Steps: The corroboration was performed by creating a new column called 'rating', which was a list of fractions with the rating_numerator and rating_denominator as the numerator and denominator, respectively. To do this, a new column was created called 'slash', containing a forward slash in each row. Then rate_numerator and rate_denominator were converted to string format. The three columns' contents were added to create a column of ratings. 

*For example: if rating_numerator = 10, and rating_denominator = 10, then these were converted to strings and concatenated to create the string '10/10'.

9. A tidiness issue. As detailed above, the twitter files were combined into twitter_relevant_data. This was performed before any of the other steps.

10. The columns 'doggo', 'puppo', 'pupper', 'floofer' were all dog stages. In the WeRateDogs tweets these indicate different dog stages. They were combined into a single column called 'dog_stage'. 

*To combine them, the contents of the four columns were concatenated to form strings. E.g. "None", "None", "puppo", "None" became a single column entry "NoneNonepuppoNone". 

*Then the new dog_stage column was visually assessed using .value_counts(). This showed that it was full of values such as "NoneNoneNoneNone", "NoneNonepuppoNone", "NonepupperNoneNone", which are unclear. 

*Clearly the entry "NonepupperNoneNone" indicates that the dog_stage was actually "pupper". So for each dog_stage entry where there were three "None"'s and one actual dog stage, the entry was replaced by the dog stage only. These were replaced using a pandas DataFrame condition.

*However there were entries such as "doggoNonepuppoNone", where users were undecided about the dog stage. These rows were dropped from the dataset because they do not indicate clearly which dog stage is present in the tweet. Also, this could be a misreading and actually the tweet_json and twitter_enhanced_archive datasets may be disagreeing here. Either way they do not belong in the cleaned dataset (unless the project's goal were to investigate the consistency of dog_stage classification . . . which it was not!). 

*Finally, the original columns were dropped from the dataset using pd.DataFrame.drop(). 

### Dropping Irrelevant Columns

Some irrelevant columns were dropped. All the relevant columns are listed below. Columns were classed as irrelevant mostly because they replicated other columns. E.g. "id_str" is only a list of strings of the id. We already have "id", which was an int64 column, and so "id_str" was irrelevant.

Relevant columns: ['favorite_count',
'retweet_count',
'quoted_status_id',
'retweeted_status',
'doggo',
'floofer',
'puppo',
'pupper',
'retweeted_status_id',
'lang',
'full_text',
'entities',
'created_at',
'timestamp',
'rating_numerator',
'rating_denominator',
'text']

### Storing the cleaned dataframe

The dataframe was stored as a .csv file using pd.to_csv().