## Reporting: wragle_report
* Create a **300-600 word written report** called "wrangle_report.pdf" or "wrangle_report.html" that briefly describes your wrangling efforts. This is to be framed as an internal document.

## Data Wrangling Project - Babajide Tobiloba

The goal of this project is the practice the data wrangling skills obtained in the Data Wrangling Course of the Udacity Data Analyst Nanodegree program. The data that was wrangled is the tweet archive of a funny Twitter account that rates people’s dogs, known as We Rate Dogs (@Dog_rates). This data was gathered from multiple sources, cleaned, and then, used for analysis.
This report documents my data wrangling journey.

### Data Gathering

The data for this project came from three different datasets, which were obtained as follows:

`twitter_archive_enhanced.csv`: This file was provided by Udacity and after downloading it, I loaded it into my workspace using the pandas function `read_csv()`.

`image_predictions.tsv`: I added Python requests and os libraries. I obtained the data via its url using the requests library's `get()` function and saved it in a response variable. I wrote this response into a tsv file called image_predictions. Then, I loaded the tsv file using the pandas function `read_csv()`.

`tweet_json.txt`: Unfortunately, my developer account was not approved by Twitter in time for the project, and so I was not able to write any of my own code for this section of the data gathering. I proceeded to use the code provided by Udacity but encountered errors while doing so. I eventually used the json file provided by Udacity for the analysis. Using the with open function, I read the file as a dataframe.

### Data Assessment

I then proceeded to assess the data after obtaining the three tables. Two assessment methods were employed:
__Visual assessment__: I printed out each of the three dataframes separately and examined thoroughly.

__Programmatic assessment__: Using various python and pandas methods and functions, including `.info()`, `.duplicated()`, `.isnull()`, `.describe()`, `.unique()`, I conducted various programmatic assessments.

### Data Cleaning

Now, I had to clean the data by following the Define, Code, Test format. First, I created copies of the data frames to be cleaned using the `.copy()` function.
I first documented the issues I observed in the data frames by visual and programmatic assessment:
__Quality issues__

a. twitter_archive dataframe:

1.	there are some retweeted tweets as shown in the `retweeted_status_id column`
2.	`timestamp` is in string format instead of datetime
3.	59 missing values in `expanded_urls` column
4.	`tweet_id` column is in integer format instead of string
5.	`rating_denominator` column should only have values of 10

b. image_predictions dataframe:

1.	incorrect dog breed name "orange" in `p1` column
2.	incorrect dog breed name "spatula" in `p3` column
3.	`tweet_id` column is in integer format instead of string

c. json_twitter_archive dataframe:

1.	143 missing values in `possibly_sensitive` and `possibly_sensitive_appealable` columns
2.	281 missing values in `extended_entities` column

__Tidiness Issues__

1. `doggo`, `floofer`, `pupper`, `puppo` columns need to be combined into one column 

2. inconsistent naming of `id_str` in `json_twitter_archive`

3. certain columns are not needed for analysis and visualization

4. The three dataframes need to be combined into one table with only the relevant columns

Then, I proceeded to clean them:

- I dropped rows that have values in `retweeted_status_id` column in `twitter_archive` dataframe
- I converted `timestamp` to datetime format from string format in `twitter_archive` dataframe
- I deleted rows with missing values in expanded_urls column in `twitter_archive` dataframe
- I converted `tweet_id` column to string format instead of integer format in `twitter_archive` dataframe
- I dropped rows where values in `rating_denominator` column are greater than 10 in `twitter_archive` dataframe
- I dropped the row with incorrect dog breed name "orange" in `p1` column in `image_predictions` dataframe
- I dropped row with incorrect dog breed name "spatula" in `p3` column in `image_predictions` dataframe
- I converted `tweet_id` column to string format instead of integer format in `image_predictions` dataframe
- I deleted rows with missing values in `possibly_sensitive` column in `json_twitter_archive` dataframe
- I deleted rows with missing values in extended_entities column `json_twitter_archive` dataframe
- I combined `doggo`, `floofer`, `pupper`, `puppo` into one column called `stage` in `twitter_archive` dataframe
- I renamed `id_str` in json_twitter_archive to `tweet_id`
- I dropped unnecessary columns in the three dataframes
- I merged the three dataframes into one


### Data Storing

I then stored the merged data as a csv file called `twitter_archive_master.csv` using the `to_csv()` function to prepare it for analysis.

### Conclusion

This project was challenging, especially the data gathering section of it. I encountered challenge while trying to query Twitter’s API and this stalled me for some time but in the end, I was able to proceed and eventually carry out the data wrangling. I hope to work on more data wrangling projects to become an expert data wrangler.
