<center>
    <h1 style="font-size: 32px;">Wrangle and Analyze Data</h1>
    <h2>Abdulrahman Mohammed Alobaidy</h2>
    <h3>Cohort 9</h3>
    <h3>Data Analyst Nanodegree Program</h3>
    <h4>Email: <a href="mailto:AbdulrahmanAlobaidy2001@gmail.com">AbdulrahmanAlobaidy2001@gmail.com</a></h4>
</center>
<hr>

# Introduction
---

In this project we have the [**WeRateDogs**](https://twitter.com/dog_rates) (**@dog_rates**) tweets and data that we must wrangle in order for the data to be suitable for analysis purposes.

[**WeRateDogs**](https://twitter.com/dog_rates) is a twitter page that posts photos of pet dogs, each with a rating, sometimes with a dog stage and the dog's name, here, we will attempt to **gather** data related to the page from multiple sources, **assess** this data's quality and tidiness, and **clean** the data and fix the problems found during the assessment stage, **Data Wrangling** essentially, in order to prepare the data for the ultimate goal, which is **data analysis**

<img src="https://github.com/AbdulrahmanAlobaidy/DAND-project-4/blob/main/tweet.png?raw=true" alt="WeRateDogs Tweet" width="50%">

# Data Gathering
---

We will be gathering data from three sources, text file, request from the internet and finally read from API, for the last one, I couldn't get access to the Twitter API, thus I opted to read the data from a text file.

### 1. `twitter_archive_enhanced.csv`
First, we read a `csv` file named `twitter_archive_enhanced.csv`, which contains information about the tweets, into a Pandas DataFrame named `df_enhanced`.
### 2. `image_predictions.tsv`
Then, we read a `tsv` file named `image_predictions.tsv` using the Python's Requests library, store it into a Pandas DataFrame named `df_image_predictions`, then saved it locally.
### 3. `tweet-json.txt`
Finally, we read the API data that provides more information about the tweets from the first step, we read the `tweet-json.txt` file line by line, each line is a `json` string, then we parse and store every line in a list to be later imported into a Pandas DataFrame named `df_api`, at last, this DataFrame is exported as a file named `tweet_json.csv`.

# Data Assessment
---

## Visual Assessment

Now it's time to first assess the datasets visually.

---
### 1. `df_enhanced`
We start by displaying the first five rows from the first DataFrame, `df_enhanced`

The first thing we notice is that the `rating_denominator` column is completely redundant, that is because the value is constant and is always a **10**.


The second thing we notice is that the dog stages are distributed into four columns, (`doggo`, `floofer`, `pupper`, `puppo`), and need to be combined into a single column.

Finally, there is no rating column that calculates the result of the numerator divided by the denominator, we would need that for analysis purposes.

### 2. `df_image_predictions`
Now we display the first five rows from the `df_image_predictions` DataFrame.

This one seemed okay.

### 3. `df_api`
At last, we also display the first five rows from the `df_api` DataFrame.

This one also seemed fine, but there were a lot of collapsed columns, so we need to further examine it programmatically.

## Programmatic Assessment


Now it's time to assess the datasets programmatically.

---

### 1. `df_enhanced`
We use the Pandas' `info` method on `df_enhanced` in order to take a look at the columns, their non-nulls and datatypes.

We noticed the following:

* The `timestamp` column is of datatype `object` instead of `datatime`.

* The retweets need to be removed from the dataset in order to analyze only the **WeRateDogs** tweets, thus we don't need the `in_reply_to_status_id`, `in_reply_to_user_id`, `retweeted_status_id`, `retweeted_status_user_id` and `retweeted_status_timestamp` columns.

One of the **Key Points** in the **Project Motivation** section mentioned that there could be a problem with the `rating_numerator` and `rating_denominator` columns, so we used Pandas' `value_counts` method in order to look at the instances of values in the `rating_numerator` column only, since the `rating_denominator` column was to be removed anyways.

We did find some wrong-seeming values such as 420, 144, 960 and etc... .

### 2. `df_image_predictions`
We also used the `info` method on `df_image_predictions`, fortunatly, every thing seemed good.

### 3. `df_api`
Again, we first used the `info` method, and we noticed the following:

* The `created_at` column is of datatype `object` instead of `datetime`.
* Both the `possible_sensitive` as well as the `possibly_sensitive_appealable` columns are of datatype `object` where they should be `bool`.

Then we ran the `value_counts` method on the `lang` column, and decided that this column's datatype to be changed from `object` to `category`.

# Assessment Results
---

### Tidiness Issues:
1. The group stages columns (`doggo`, `floofer`, `pupper`, `puppo`) need to be melted into a single column.
2. Merge all of the three DataFrames into a single master dataset.

### Quality Issues:
1. The `rating_denominator` column in `df_enhanced` is redundant.
2. Remove retweets from `df_enhanced`.
3. Remove the `in_reply_to_status_id`, `in_reply_to_user_id`, `retweeted_status_id`, `retweeted_status_user_id` and `retweeted_status_timestamp` columns from `df_enhanced`.
4. `timestamp` column of `df_enhanced` should be of datatype `datetime`.
5. Extract the `rating_numerator` from `text` in `df_enhanced`.
6. Convert `rating_numerator`'s datatype to `float`.
7. Remove records from `df_enhanced` with wrong `rating_numerator` values.
8. Missing `rating` column.
9. `created_at` column in `df_api` should be of datatype `datetime`.
10. `possibly_sensitive` column in `df_api` should be of datatype `bool`
11. `possibly_sensitive_appealable` column in `df_api` should be of datatype `bool`.
12. `lang` column in `df_api` should be of datatype `category`.

# Data Cleaning
---

First of all, we create copies of the three DataFrames to preserve the originals in case we accidentally corrupt the one or more of the DataFrames, we name them, `df_enhanced_copy`, `df_image_predictions_copy` and `df_api_copy` accordingly.


Now we address the aforementioned issues.

---

#### 1. The group stages columns (`doggo`, `floofer`, `pupper`, `puppo`) need to be melted into a single column.

At first, we replace the **None**s with **NaN**s, then we combine the four columns by joining them with a comma, and we replace the empty strings with **NaN**s, since some of the records have no dog stage, we store the combined data to a new column called `dog_stage` and we drop the four columns.


#### 2. The `rating_denominator` column in `df_enhanced` is redundant.

Here, we just drop the `rating_denominator` column for being always **10**.


#### 3. Remove retweets from `df_enhanced`.

We remove the records in `df_enhanced_copy` with nulls in the `in_reply_to_status_id` column.


#### 4. Remove the `in_reply_to_status_id`, `in_reply_to_user_id`, `retweeted_status_id`, `retweeted_status_user_id` and 

#### `retweeted_status_timestamp` columns from `df_enhanced`.

We remove the aforementioned columns from `df_enhanced_copy`.


#### 5. `timestamp` column of `df_enhanced` should be of datatype `datetime`.

We change the `timestamp` column datatype into `datetime` using Pandas' `to_datetime` method.


#### 6. Extract the `rating_numerator` from `text` in `df_enhanced`.

We extract the numerator from the `text` column in `df_enhanced_copy`, then store it in `rating_numerator`.


#### 7. Convert `rating_numerator`'s datatype to `float`.

We convert the `rating_numerator`'s datatype to `float`.

#### 8. Remove records from `df_enhanced` with wrong `rating_numerator` values.

We get the indices where the `rating_numerator` is 1776 or 420, then drop them from `df_enhanced_copy`.


#### 9. Missing `rating` column.

Divide the `rating_numerator` by **10** and store it in `rating` column.


#### 10. `created_at` column in `df_api` should be of datatype `datetime`.

Change the `created_at` column in `df_api_copy` to `datetime` using Pandas' `to_datetime` method.

#### 11. `possibly_sensitive` column in `df_api` should be of datatype `bool`

Change the datatype of the `possibly_sensitive` column in `df_api_copy` to `bool`.

#### 12. `possibly_sensitive_appealable` column in `df_api` should be of datatype `bool`.

Change the datatype of the `possibly_sensitive_appealable` column in `df_api_copy` to `bool`.

#### 13. `lang` column in `df_api` should be of datatype `category`.

Change the datatype of the `lang` column in `df_api_copy` to `category`.

#### 14. Merge all of the three DataFrames into a single master dataset.

We keep only the needed columns from `df_api_copy`, then use Pandas' `merge` method to combine all of the three cleaned DataFrames into a single master dataset, then save this dataset into a file called `twitter_archive_master.csv`.

# Resources
---

All of the code and datasets for this project are included on a [Github Repository](https://github.com/AbdulrahmanAlobaidy/DAND-project-4) dedicated for this project.