### Extracting Data
Extracting data is almost always the first step when building a data pipelines. There are tons of shapes and sizes that data can be extracted from. Here are just a few:

- API's
- SFTP sites
- Relational databases
- NoSQL databases (columnar, document, key-value)
- Flat-files

In this code-along, we'll focus on extracting data from flat-files. A flat file might be something like a .csv or a .json file. The two files that we'll be extracting data from are the apps_data.csv and the review_data.csv file. To do this, we'll used pandas. Let's take a closer look!

* After importing pandas, read the apps_data.csv DataFrame into memory. Print the head of the DataFrame.
* Similar to before, read in the DataFrame stored in the review_data.csv file. Take a look at the first few rows of this DataFrame.
* Print the column names, shape, and data types of the apps DataFrame.


The code above works perfectly well, but this time let's try using DRY-principles to build a function to extract data.

- Create a function called extract, with a single parameter of name file_path.
- Sprint the number of rows and columns in the DataFrame, as well as the data type of each column. Provide instructions about how to use the value that will eventually be returned by this function.
- Return the variable data.
- Call the extract function twice, once passing in the apps_data.csv file path, and another time with the review_data.csv file path. Output the first few rows of the apps_data DataFrame.


In [3]:
# let's import the pandas package
import pandas as pd

# let's load the apps data
app_data = pd.read_csv("data/apps_data.csv")
app_data.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up


In [4]:
# let's load the review data
review_data = pd.read_csv("data/review_data.csv")

review_data.head()

Unnamed: 0,App,Translated_Review,Sentiment,Sentiment_Polarity,Sentiment_Subjectivity
0,10 Best Foods for You,I like eat delicious food. That's I'm cooking ...,Positive,1.0,0.533333
1,10 Best Foods for You,This help eating healthy exercise regular basis,Positive,0.25,0.288462
2,10 Best Foods for You,,,,
3,10 Best Foods for You,Works great especially going grocery store,Positive,0.4,0.875
4,10 Best Foods for You,Best idea us,Positive,1.0,0.3


In [5]:
# let's print the column names, shape, and data types of the apps DataFrame.

app_data.columns

Index(['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type',
       'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver',
       'Android Ver'],
      dtype='object')

In [6]:
app_data.shape

(10841, 13)

In [7]:
app_data.dtypes

App                object
Category           object
Rating            float64
Reviews            object
Size               object
Installs           object
Type               object
Price              object
Content Rating     object
Genres             object
Last Updated       object
Current Ver        object
Android Ver        object
dtype: object

The code above works perfectly well, but this time let's try using DRY-principles to build a function to extract data.

- Create a function called extract, with a single parameter of name file_path.
- Sprint the number of rows and columns in the DataFrame, as well as the data type of each column. Provide instructions about how to use the value that will eventually be returned by this function.
- Return the variable data.
- Call the extract function twice, once passing in the apps_data.csv file path, and another time with the review_data.csv file path. Output the first few rows of the apps_data DataFrame.

In [8]:
# let's create the function to automate loading the data
def extract(file_path):
    # load the data
    data = pd.read_csv(file_path)
    
    # let's print the column names, shape and data types of the columns in the dataframe
    print('================ Detailed Information about the dataframe ===================')
    print(f'There are {data.shape[0]} rows and {data.shape[1]} columns in the dataframe')
    print('Below are the columns and their respective data types')
    print(data.dtypes)
    
    print("\nNow, let's view sample data from the dataframe")
    
    return data


# let's test the function
app_df = extract("data/apps_data.csv")
app_data
    

There are 10841 rows and 13 columns in the dataframe
Below are the columns and their respective data types
App                object
Category           object
Rating            float64
Reviews            object
Size               object
Installs           object
Type               object
Price              object
Content Rating     object
Genres             object
Last Updated       object
Current Ver        object
Android Ver        object
dtype: object

Now, let's view sample data from the dataframe


Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up
...,...,...,...,...,...,...,...,...,...,...,...,...,...
10836,Sya9a Maroc - FR,FAMILY,4.5,38,53M,"5,000+",Free,0,Everyone,Education,"July 25, 2017",1.48,4.1 and up
10837,Fr. Mike Schmitz Audio Teachings,FAMILY,5.0,4,3.6M,100+,Free,0,Everyone,Education,"July 6, 2018",1.0,4.1 and up
10838,Parkinson Exercices FR,MEDICAL,,3,9.5M,"1,000+",Free,0,Everyone,Medical,"January 20, 2017",1.0,2.2 and up
10839,The SCP Foundation DB fr nn5n,BOOKS_AND_REFERENCE,4.5,114,Varies with device,"1,000+",Free,0,Mature 17+,Books & Reference,"January 19, 2015",Varies with device,Varies with device


### Transforming Data
We're interested in working with the apps and their corresponding reviews in the "FOOD_AND_DRINK" category. We'd like to do the following:

- Define a function with name transform. This function will have five parameters; apps, review, category, min_rating, and min_reviews.
- Drop duplicates from both DataFrames.
- For each of the apps in the desired category, find the number of positive reviews, and filter the columns.
- Join this back to the apps dataset, only keeping the following columns:
    - App
    - Rating
    - Reviews
    - Installs
    - Sentiment_Polarity
- Filter out all records that don't have at least the min_rating, and more than the min_reviews.
- Order by the rating and number of installs, both in descending order.
- Call the function for the "FOOD_AND_DRINK" category, with a minimum average rating of 4 stars, and at least 1000 reviews.

Alright, let's give it a shot!

In [19]:
pd.options.mode.copy_on_write = True

In [20]:
# let's define the transform function
def transform(apps, reviews, category, min_rating, min_reviews):
    print(f"Let's transform the data to curate a dataset with all {category} apps and their corresponding reviews with a rating of at least {min_rating} and {min_reviews} reviews")

    # let's drop the duplicates
    apps = apps.drop_duplicates(['App'])
    reviews = reviews.drop_duplicates()

    # getting all the apps in the food & drinks category
    food_drinks_apps = apps.loc[apps['Category'] == category, :]
    food_drinks_reviews = reviews.loc[reviews['App'].isin(food_drinks_apps['App']), ['App', 'Sentiment_Polarity']]

    # calculating the mean
    avg_reviews = food_drinks_reviews.groupby(by='App').mean()

    # combining the apps and reviews dataframe
    combined_dataset = food_drinks_apps.join(avg_reviews, on="App", how="left")

    # getting only the needed columns
    filtered_apps_reviews = combined_dataset[['App', 'Rating', 'Reviews', 'Installs', 'Sentiment_Polarity']]

    # convert the review column to integer
    filtered_apps_reviews['Reviews'] = filtered_apps_reviews['Reviews'].astype(int)

    # getting the reviews with average ratings of 4 and at least 1000 reviews
    top_apps = filtered_apps_reviews.loc[(filtered_apps_reviews['Rating'] > min_rating) & (filtered_apps_reviews['Reviews'] >= min_reviews), :]

    # sorting the top apps
    top_apps.sort_values(by=['Rating', 'Reviews'], ascending=False, inplace=True)

    top_apps.reset_index(drop=True, inplace=True)

    # saving the top apps dataframe as a csv file
    top_apps.to_csv("top_apps.csv")

    print(f"The transformed dataframe, which include {top_apps.shape[0]} rows and {top_apps.shape[1]} columns")

    return top_apps

In [21]:
# let's test the transform function
top_app_data = transform(apps=app_data, reviews=review_data, min_rating=4.0, min_reviews=1000, category='FOOD_AND_DRINK')

top_app_data

Let's transform the data to curate a dataset with all FOOD_AND_DRINK apps and their corresponding reviews with a rating of at least 4.0 and 1000 reviews
The transformed dataframe, which include 54 rows and 5 columns


Unnamed: 0,App,Rating,Reviews,Installs,Sentiment_Polarity
0,SarashpazPapion (Cooking with Chef Bowls),4.8,1250,"50,000+",
1,Domino's Pizza USA,4.7,1032935,"10,000,000+",0.226971
2,Tastely,4.7,611136,"10,000,000+",
3,Delicious Recipes,4.7,129737,"1,000,000+",
4,BeyondMenu Food Delivery,4.7,51517,"1,000,000+",0.408743
5,Recipes Pastries and homemade pies More than 5...,4.7,14065,"500,000+",
6,Pastry & Cooking (Without Net),4.7,6118,"1,000,000+",
7,Simple Recipes,4.7,3803,"500,000+",
8,Easy Recipes,4.7,2707,"100,000+",0.284777
9,OpenTable: Restaurants Near Me,4.6,90242,"5,000,000+",
