<h1 style="color: orange">Wrangle report</h1>

This report documents my efforts to wrangle and clean the dataset from the "WeRateDogs" Twitter Account. This report starts with a quick introduction on the dataset, and then, a gathering effort, and lastly the cleaning and assessing efforts done in the project using visual & programmatic assessment

<h2 style="color: green">Introduction</h2>

This project is about a dataset called `WeRateDogs`. and as project requirements imply, I needed to perform a complete data analysis process starting from data gathering, wrangling and ending with EDA.

<h3 style="color: green">Goals</h3>
* Perform a full data analysis process.
* Gather data from different resources.
* Explore real data.
* Use Programmatic and Visual Assessment to assess data.
* Explore the popularity of this Twitter account more deeply.

<h3 style="color: green">Questions</h3>

Before we start the analysis, We need to ask a few questions to begin with:
* What is the overall distribution of tweet engagement metrics `favorites and retweets`?
* How are the different dog stages `dog_stage` represented in the dataset?
* What are the most frequently mentioned dog breeds and how do they correlate with tweet engagement?

<h3 style="color: green">Tools Used:</h3>

* Python
* Jupyter Notebooks
* Numpy
* Pandas
* Seaborn
* Matplotlib

<h3 style="color: green">Phases of the Analysis:</h3>

* Gathering
* Assessing
* Cleaning
* Storing
* Analysis & Visualization

<h2 style="color: green">Data gathering 🔥</h2>

I was provided with three datasets related to this Twitter Account from different resources and formats.

* `twitter-archive-enhanced.csv`: A Twitter archive containing all of the dataset information about tweets like (timestamp, id, retweets, ...etc).
* `image-predictions.tsv`: Installed programmatically from a link, and read using pandas function `read_csv()` and they're results from an AI Model for image predictions with different percentages.
* `tweet-json.txt`: Used the Twitter API to extract information about tweets

<h2 style="color: green">Assessing 👀</h2>

### Quick Observation
We have a dataframe that consists of many data related to different tweets, each tweet is recorded with time and id, which means No duplicates are expected to be found.

**We have many functional columns like:**

* Dog Breeds Columns
* Dog names
* Rating Columns
* Tweet's text
* Tweet's Timestamp
* Our main data is `categorical`, which means We're not expecting to perform many numerical analysis till now.

### Assessment Report:
By performing visual and programmatic assessment, here's the following results:

**Quality**

Completeness
- Many missing values in `in_reply_to_status_id`, `in_reply_to_user_id`, `in_reply_to_status_id`, `retweeted_status_user_id`, `retweeted_status_timestamp`, `doggo`, `floofer`, `puppo`, `pupper`, `name`, `expanded_urls`
- `names` are collected from the `text` columns where the dog breed column is filled only if it's mentioned in the tweet, the `name` is filled when the name is in the tweet
- The `contributors`, `coordinates`, and `geo` have missing data

Uniqueness
- <i>No duplicates found</i>

Validity
- `timestamp`, `created_at` is supposed to be `datetime[ns]`
- `id` columns should be `object`
- `name` found is wrong in the tweet
- `p2* ... p3*` are not very high when `p1` is of a dog
- Some columns contain JSON data (e.g. `entities`)

Accuracy
- `name` found the dogs to have less than 3 characters as the name
- `rating` numerator doesn't seem to be accurate
- `img_num` has a number of `2` in a single img
- `p1*` isn't so accurate

Inconsistency
- `name` found inconsistent name patterns
- `p1 ... p3` columns are found in different name patterns in underscores

**Tidiness Issues**
- `text` column has information about `rating`, `expanded_urls`

<h2 style="color: green">Cleaning 🧹</h2>

We're required to access and clean at least **8 quality issues** and at least **2 tidiness** issues in this dataset.

<h3 style="color: yellow">Quality Issue: Change Indexes</h3>
I noticed that each column that defines the index on each column is different and it's gonna result in merge issues. I solved this issue by renaming each index column into the same name and remove the `index` column and set the index as the `tweet_id`

<h3 style="color: purple">Tidiness Issue: Merge the three different datasets into one big dataframe</h3>
I intended to start by Tidiness Issues like merging to make it easier for me to clean the dataset as whole. but it was my priority to merge the datasets into the one whole.

<h3 style="color: yellow">Quality Issue: Remove empty and unneeded columns</h3>
The project's description implied the presence of some unnecessary columns so, I had to remove them completely after merging directly to maintain readable data and useful analysis.

<h3 style="color: yellow">Quality Issue: Correct data types</h3>
One of the most famous issues is the wrong datatypes for a certain columns because every data entry client enters the data into an excel sheet using a spreadsheet software. So, some of the columns had wrong ones like `timestamp`

<h3 style="color: yellow">Quality Issue: Make sure all tweets have images and ratings</h3>
It's a check for whether each column has ratings and images to ensure minimal null values.

<h3 style="color: yellow">Quality Issue: Check Doggie stage columns spread</h3>
Since I have **4** columns containing each dog stage in the image and if the dog isn't in the image. We have the `dog_stage` column as `null`.

<h3 style="color: yellow">Quality Issue: Make a new rating column</h3>
Since we can't create a new ratio to alter the numbers. We can use another numbering system to display to rating which is the decimal numbering system.

<h3 style="color: yellow">Quality Issue: Image prediction column needs casing</h3>
I see that the image prediction strings are mixed-case (some begin with capital letters, others with lowercase). I'm going to fix this by converting them all to lowercase.

<h3 style="color: yellow">Quality Issue: Image prediction are spread accross 3 columns and should be in a single column</h3>
Let's take a look at the image predictions. Here's how the columns are described:

- p1 is the algorithm's #1 prediction for the image in the tweet → golden retriever
- p1_conf is how confident the algorithm is in its #1 prediction → 95%
- p1_dog is whether or not the #1 prediction is a breed of dog → TRUE
- p2 is the algorithm's second most likely prediction → Labrador retriever
- p2_conf is how confident the algorithm is in its #2 prediction → 1%
- p2_dog is whether or not the #2 prediction is a breed of dog → TRUE

We'll probably need to pick a confidence level, let's take a look at the ranges.

<h3 style="color: yellow">Quality Issue: Missing names</h3>
I noticed a huge number of names are missing. I couldn't delete them as they occupy a huge percentage, However, I found out that if I chose the most repeated name. I could replace it with null values. And also There was also wrong names like `a`, `such`, `an`, `O`

<h3 style="color: purple">Tidiness Issue: Make a new "Social Total" Column</h3>
I'm going to be looking at the patterns of retweets and favorites for individual tweets, so I'll create a new combined derived column called social_total that equals the sum of favorite_count and retweet_count so I don't need to keep having to perform this arithmetic.

<h2 style="color: green">Storing 📦</h2>

After storing the dataset into one big dataset. and performing the cleaning process. I put the new dataframe into a new `csv` file.