## **Play Store Apps Analysis**
Imagine you're diving into a big dataset from the Google Play Store. You'll be using Python to explore it and find interesting insights.

First, the data will be cleaned up using Python to fix any mistakes and make it look neat.

Once the data is clean, SQL will be used to find hidden patterns and correlations in the data. However, in this project, we are only going to work on cleaning tasks.

Visualizations will also be created to make it easier to understand the data. Charts and graphs will show patterns and trends.

By the end, you'll have learned a lot about how to make smart decisions using data. This project helps everyone understand how to use data to make better choices in our digital world.

Join this adventure as you explore the Google Play Store dataset. Get ready to find out all sorts of interesting things about your favorite apps!

### **Install the required libraries**
**langchain**: An open source framework for building applications based on large language models (LLMs).

**langchain_experimental**: An experimental extension of the langchain library, offering cutting-edge features and functionalities for advanced natural language processing tasks.

**openai**: A powerful Python library developed by OpenAI, offering state-of-the-art natural language processing capabilities. It provides tools and utilities for a wide range of NLP tasks, including text generation, text classification, language translation, and more.

In [4]:
!pip install langchain
!pip install langchain_experimental
!pip install openai



### Task 1: Data Preparation: Importing Libraries, Setting up API Key, and Reading Data
Before we start our analysis, let's ensure we have everything we need. First, we'll import the required libraries. Then, we'll set up our GPT-3.5 API key for access. Finally, we'll read the data from our CSV file and get everything ready.




**To obtain an API key for OpenAI, follow these steps:**

* Log in to your OpenAI account at OpenAI Login Page(https://platform.openai.com/login?launch).

* Generate and retrieve your API key by visiting API Key Generation Page.(https://platform.openai.com/api-keys)



Download the Playstore Datasets.(https://drive.google.com/drive/folders/1l_TabQZpTRypVKwVcHoD98YSAyXZOu91?usp=sharing)



In [5]:
from google.colab import drive
drive.mount('/content/drive/')

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).


In [6]:
import os
import pandas as pd
from langchain_experimental.agents.agent_toolkits import create_pandas_dataframe_agent
from langchain.llms import OpenAI

#Setup the api key
os.environ['OPENAI_API_KEY']=""

#Load the dataset (playstore_apps.csv) and assign it to a variable 'df'.

inputfile = "/content/playstore_apps.csv"

df = pd.read_csv(inputfile)


### Task 2: Display Top 5 Records.

Great! Initializing our agent(OpenAI) is our next step. Once that's done, we'll create a prompt to display the first 5 records from our data table

In [7]:
# Initializing the agent
agent = create_pandas_dataframe_agent(OpenAI(temperature=0),
              df, verbose=True)

# write your prompt using the agent

agent("Display Top 5 records of dataframe df")



  warn_deprecated(
  warn_deprecated(




[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: I need to use the head() function to display the first 5 rows of the dataframe
Action: python_repl_ast
Action Input: print(df.head())[0m[36;1m[1;3m                                                 App        Category  Rating  \
0     Photo Editor & Candy Camera & Grid & ScrapBook  ART_AND_DESIGN     4.1   
1                                Coloring book moana  ART_AND_DESIGN     3.9   
2  U Launcher Lite – FREE Live Cool Themes, Hide ...  ART_AND_DESIGN     4.7   
3                              Sketch - Draw & Paint  ART_AND_DESIGN     4.5   
4              Pixel Draw - Number Art Coloring Book  ART_AND_DESIGN     4.3   

    Reviews  Size    Installs  Type  Price Content Rating  \
0     159.0   19M     10000.0  Free    0.0       Everyone   
1     967.0   14M    500000.0  Free    0.0       Everyone   
2   87510.0  8.7M   5000000.0  Free    0.0       Everyone   
3  215644.0   25M  50000000.0  Free    0.0           Te

{'input': 'Display Top 5 records of dataframe df',
 'output': 'The top 5 records of the dataframe df are displayed above.'}

### Task 3: Determine the Shape(Count Rows and Columns) of the Dataset.
Great! For our next step, let's create a prompt to gather information about the shape of our playstore app dataset. This will help us understand how many rows and columns it contains.

In [8]:
df.shape

(10841, 13)

In [9]:
# write your prompt using the agent
agent("Tell the shape of the Dataframe df")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: I need to use the shape attribute of the dataframe
Action: python_repl_ast
Action Input: df.shape[0m[36;1m[1;3m(10841, 13)[0m[32;1m[1;3m10841 rows and 13 columns
Final Answer: The shape of the dataframe df is (10841, 13).[0m

[1m> Finished chain.[0m


{'input': 'Tell the shape of the Dataframe df',
 'output': 'The shape of the dataframe df is (10841, 13).'}

### Task 4: Finding the Missing data.
Wow! We've successfully found out the shape of the dataset. Now, let's create a prompt to ask our agent to help us find details about any missing information in the data table

In [10]:
df.isnull().sum()

App                  0
Category             0
Rating            1474
Reviews              1
Size                 0
Installs             1
Type                 1
Price                1
Content Rating       1
Genres               0
Last Updated         1
Current Ver          8
Android Ver          3
dtype: int64

In [11]:
# write your prompt using the agent
agent("Let us know the count of missing data present in the dataframe df")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: I need to find a way to count the number of missing values in the dataframe
Action: python_repl_ast
Action Input: df.isnull().sum()[0m[36;1m[1;3mApp                  0
Category             0
Rating            1474
Reviews              1
Size                 0
Installs             1
Type                 1
Price                1
Content Rating       1
Genres               0
Last Updated         1
Current Ver          8
Android Ver          3
dtype: int64[0m[32;1m[1;3m I now know the final answer
Final Answer: There are a total of 1491 missing values in the dataframe df.[0m

[1m> Finished chain.[0m


{'input': 'Let us know the count of missing data present in the dataframe df',
 'output': 'There are a total of 1491 missing values in the dataframe df.'}

### Task 5: Identifying Duplicate Rows.
Fantastic! We've successfully found the missing values. Now, let's create a prompt to ask our agent to count the number of duplicate rows in the playstore app dataset.

In [12]:
df.duplicated().sum()

483

In [13]:
# write your prompt using the agent

agent("Display the count of Duplicate rows in dataframe df")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: I need to use the pandas dataframe method 'duplicated' to find duplicate rows
Action: python_repl_ast
Action Input: df.duplicated()[0m[36;1m[1;3m0        False
1        False
2        False
3        False
4        False
         ...  
10836    False
10837    False
10838    False
10839    False
10840    False
Length: 10841, dtype: bool[0m[32;1m[1;3mNow I need to use the pandas dataframe method 'sum' to count the number of True values in the duplicated column
Action: python_repl_ast
Action Input: df.duplicated().sum()[0m[36;1m[1;3m483[0m[32;1m[1;3m483 duplicate rows were found in the dataframe
Final Answer: 483[0m

[1m> Finished chain.[0m


{'input': 'Display the count of Duplicate rows in dataframe df',
 'output': '483'}

### Task 6: Dropping Unnecessary Columns.
Fantastic! We successfully found the duplicates. Now, let's create a prompt to ask our agent to remove the Last Updated, Current Ver, Android Ver and Size columns from the playstore dataset.

In [14]:
# write your prompt using the agent
agent("Please drop the columns 'Last Updated', 'Current Ver', 'Android Ver' and 'Size' from df and update df")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: I need to use the drop function to remove the specified columns and then reassign the updated dataframe to df
Action: python_repl_ast
Action Input: df.drop(['Last Updated', 'Current Ver', 'Android Ver', 'Size'], axis=1, inplace=True)[0m[36;1m[1;3m[0m[32;1m[1;3mI now know the final answer
Final Answer: The updated dataframe df with the specified columns dropped.[0m

[1m> Finished chain.[0m


{'input': "Please drop the columns 'Last Updated', 'Current Ver', 'Android Ver' and 'Size' from df and update df",
 'output': 'The updated dataframe df with the specified columns dropped.'}

In [15]:
df.shape

(10841, 9)

### Task 7: Removing Duplicate Rows.
Great progress! We were successful in dropped the columns. Now, let's create a prompt to ask our agent to remove duplicate rows from the playstore dataset.

In [16]:
# write your prompt using the agent
agent("""Remove the the duplicate rows in the dataframe by keeping the first occurence
         Ensure the changes are updated to dataframe df """)



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: I need to use the drop_duplicates() function to remove duplicate rows, and set the inplace parameter to True to update the dataframe.
Action: python_repl_ast
Action Input: df.drop_duplicates(inplace=True)[0m[36;1m[1;3m[0m[32;1m[1;3mI now know the final answer
Final Answer: The duplicate rows have been removed from the dataframe df.[0m

[1m> Finished chain.[0m


{'input': 'Remove the the duplicate rows in the dataframe by keeping the first occurence\n         Ensure the changes are updated to dataframe df ',
 'output': 'The duplicate rows have been removed from the dataframe df.'}

In [17]:
agent("What is the shape of the dataframe df")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: I need to use the shape attribute of the dataframe
Action: python_repl_ast
Action Input: df.shape[0m[36;1m[1;3m(10356, 9)[0m[32;1m[1;3m10356 rows and 9 columns
Final Answer: The shape of the dataframe df is (10356, 9).[0m

[1m> Finished chain.[0m


{'input': 'What is the shape of the dataframe df',
 'output': 'The shape of the dataframe df is (10356, 9).'}

In [18]:
df.shape

(10356, 9)

### Task 8: Filling missing values in the DataFrame.
Good job! The duplicated rows have been successfully dropped. Let's now build a prompt to request that our agent to fill missing values in the Rating, Reviews, Price and Installs columns with the mean. Fill missing values in the Type column with the string 'Free'. Finally, fill missing values in the Content Rating column with the string 'Everyone'.

In [19]:
agent("Dispaly the mean values of the columns Rating, Reviews, Price and Installs columns")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: I need to use the mean() function on the specified columns
Action: python_repl_ast
Action Input: df[['Rating', 'Reviews', 'Price', 'Installs']].mean()[0m[36;1m[1;3mRating      4.189551e+00
Reviews     4.059634e+05
Price       1.031000e+00
Installs    1.415856e+07
dtype: float64[0m[32;1m[1;3m I now know the final answer
Final Answer: Rating      4.189551e+00
Reviews     4.059634e+05
Price       1.031000e+00
Installs    1.415856e+07
dtype: float64[0m

[1m> Finished chain.[0m


{'input': 'Dispaly the mean values of the columns Rating, Reviews, Price and Installs columns',
 'output': 'Rating      4.189551e+00\nReviews     4.059634e+05\nPrice       1.031000e+00\nInstalls    1.415856e+07\ndtype: float64'}

In [20]:
# write your prompt using the agent
agent(""" Make the following changes in the dataframe df as per the below instructions and update the dataframe df

- calculate the mean of non-null values in the Rating column round it off to two decimals and replace the null values in the Rating values column to its mean.
- calculate the mean of non-null values in the Reviews column round it off to two decimals and replace the null values in the Reviews values column to its mean.
- calculate the mean of non-null values in the Price column round it off to two decimals and replace the null values in the Price values column to its mean.
- calculate the mean of non-null values in the Installs column round it off to two decimals and replace the null values in the Installs values column to its mean.
- replace the null values in the column 'Type' by 'Free'
-  replace the null values in the column 'Content Rating' by 'Everyone'
Make sure changes are updated in the dataframe df
""")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: I need to use the mean function to calculate the mean of non-null values in each column and then use the fillna function to replace the null values with the calculated mean. I also need to use the replace function to replace null values in the 'Type' and 'Content Rating' columns.
Action: python_repl_ast
Action Input: df['Rating'].mean()[0m[36;1m[1;3m4.189551231582499[0m[32;1m[1;3mI need to round off the mean to two decimals.
Action: python_repl_ast
Action Input: round(df['Rating'].mean(), 2)[0m[36;1m[1;3m4.19[0m[32;1m[1;3mI need to use the fillna function to replace null values in the 'Rating' column with the calculated mean.
Action: python_repl_ast
Action Input: df['Rating'].fillna(round(df['Rating'].mean(), 2), inplace=True)[0m[36;1m[1;3m[0m[32;1m[1;3m I need to repeat the above steps for the 'Reviews', 'Price', and 'Installs' columns.
Action: python_repl_ast
Action Input: df['Reviews'].mean()[0m

{'input': " Make the following changes in the dataframe df as per the below instructions and update the dataframe df\n\n- calculate the mean of non-null values in the Rating column round it off to two decimals and replace the null values in the Rating values column to its mean.\n- calculate the mean of non-null values in the Reviews column round it off to two decimals and replace the null values in the Reviews values column to its mean.\n- calculate the mean of non-null values in the Price column round it off to two decimals and replace the null values in the Price values column to its mean.\n- calculate the mean of non-null values in the Installs column round it off to two decimals and replace the null values in the Installs values column to its mean.\n- replace the null values in the column 'Type' by 'Free'\n-  replace the null values in the column 'Content Rating' by 'Everyone'\nMake sure changes are updated in the dataframe df\n",
 'output': 'Agent stopped due to iteration limit or

In [21]:
df.isnull().sum()

App               0
Category          0
Rating            0
Reviews           0
Installs          0
Type              0
Price             0
Content Rating    1
Genres            0
dtype: int64

In [22]:
df[df['Rating'] == 4.19]

Unnamed: 0,App,Category,Rating,Reviews,Installs,Type,Price,Content Rating,Genres
23,Mcqueen Coloring pages,ART_AND_DESIGN,4.19,61.0,100000.0,Free,0.0,Everyone,Art & Design;Action & Adventure
113,Wrinkles and rejuvenation,BEAUTY,4.19,182.0,100000.0,Free,0.0,Everyone 10+,Beauty
123,Manicure - nail design,BEAUTY,4.19,119.0,50000.0,Free,0.0,Everyone,Beauty
126,Skin Care and Natural Beauty,BEAUTY,4.19,654.0,100000.0,Free,0.0,Teen,Beauty
129,"Secrets of beauty, youth and health",BEAUTY,4.19,77.0,10000.0,Free,0.0,Mature 17+,Beauty
...,...,...,...,...,...,...,...,...,...
10824,Cardio-FR,MEDICAL,4.19,67.0,10000.0,Free,0.0,Everyone,Medical
10825,Naruto & Boruto FR,SOCIAL,4.19,7.0,100.0,Free,0.0,Teen,Social
10831,payermonstationnement.fr,MAPS_AND_NAVIGATION,4.19,38.0,5000.0,Free,0.0,Everyone,Maps & Navigation
10835,FR Forms,BUSINESS,4.19,0.0,10.0,Free,0.0,Everyone,Business


### Task 9: Eliminating Data Anomalies.

Wow! We've successfully filled the missing values. Now, let's develop a prompt to tell our agent to remove the rows with Category equal to '1.9' from the DataFrame.

In [23]:
df[df['Category'] == '1.9']

Unnamed: 0,App,Category,Rating,Reviews,Installs,Type,Price,Content Rating,Genres
10472,Life Made WI-Fi Touchscreen Photo Frame,1.9,19.0,405963.44,14158562.48,0,1.03,,"February 11, 2018"


In [24]:
# write your prompt using the agent
agent("""Drop rows from dataframe df where the value in column 'Category' is '1.9' .
Ensure the changes are updated in the dataframe df""")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: I need to use the drop function to remove rows from the dataframe.
Action: python_repl_ast
Action Input: df.drop(df[df['Category'] == '1.9'].index, inplace=True)[0m[36;1m[1;3m[0m[32;1m[1;3m I need to specify the column 'Category' and the value '1.9' to filter the rows I want to drop.
Action: python_repl_ast
Action Input: df[df['Category'] == '1.9'].index[0m[36;1m[1;3mInt64Index([], dtype='int64')[0m[32;1m[1;3m This shows that there are no rows with the value '1.9' in the 'Category' column.
Action: python_repl_ast
Action Input: df.drop(df[df['Category'] == '1.9'].index, inplace=True)[0m[36;1m[1;3m[0m[32;1m[1;3m The inplace=True parameter ensures that the changes are updated in the original dataframe.
Action: python_repl_ast
Action Input: print(df.head())[0m[36;1m[1;3m                                                 App        Category  Rating  \
0     Photo Editor & Candy Camera & Grid & ScrapBook

{'input': "Drop rows from dataframe df where the value in column 'Category' is '1.9' .\nEnsure the changes are updated in the dataframe df",
 'output': "The rows with the value '1.9' in the 'Category' column have been successfully dropped from the dataframe."}

In [25]:
df[df['Category'] == '1.9']

Unnamed: 0,App,Category,Rating,Reviews,Installs,Type,Price,Content Rating,Genres


### Task 10: Check App Distribution.

Fantastic! We've successfully removed the anamolies. Now, let's construct a prompt that tells our agent to count occurrences of each unique value in the Type column.

In [26]:
df['Type'].value_counts()

Free    9590
Paid     765
Name: Type, dtype: int64

In [27]:
# write your prompt using the agent
agent("Tell us count the occurrences of each unique value Type  column in the dataframe df ")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: I need to use the value_counts() function to count the occurrences of each unique value in the Type column.
Action: python_repl_ast
Action Input: df['Type'].value_counts()[0m[36;1m[1;3mFree    9590
Paid     765
Name: Type, dtype: int64[0m[32;1m[1;3m I now know the final answer
Final Answer: The count of occurrences of each unique value in the Type column is 9590 for Free and 765 for Paid.[0m

[1m> Finished chain.[0m


{'input': 'Tell us count the occurrences of each unique value Type  column in the dataframe df ',
 'output': 'The count of occurrences of each unique value in the Type column is 9590 for Free and 765 for Paid.'}

### Task 11:  Cleaning App Name.
Amazing Let's now develop a prompt to advise our helper to update the 'df' DataFrame by identifying rows where App column starts with '?'. And remove those rows in the dataframe.

In [28]:
# write your prompt using the agent

agent("""Identify the rows that has values starting with '?' for the 'App' column and Delete them from the dataframe df.
Make sure the changes are reflected in the dataframe df
""")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: I need to use the pandas dataframe method 'loc' to select the rows that meet the condition and then use the 'drop' method to delete them.
Action: python_repl_ast
Action Input: df.loc[df['App'].str.startswith('?')][0m[36;1m[1;3m                                                     App             Category  \
313                  ?? Comics - Manga,Novel and Stories               COMICS   
321    ?Ranobbe complete free? Novelba - Free app tha...               COMICS   
882    ?? Football Wallpapers 4K | Full HD Background...        ENTERTAINMENT   
1081                                               ??Cep              FINANCE   
1488   ???: Buying a house, selling a house, renting ...       HOUSE_AND_HOME   
1592              ????? Astrology - Min Thein Kha BayDin            LIFESTYLE   
2575      ?? WhatsLov: Smileys of love, stickers and GIF               SOCIAL   
3750                                         ??? ???

{'input': "Identify the rows that has values starting with '?' for the 'App' column and Delete them from the dataframe df.\nMake sure the changes are reflected in the dataframe df\n",
 'output': "The rows that have values starting with '?' for the 'App' column have been successfully deleted from the dataframe df."}

In [29]:
df.loc[df['App'].str.startswith('?')]

Unnamed: 0,App,Category,Rating,Reviews,Installs,Type,Price,Content Rating,Genres


In [30]:
# export the cleaned dataframe 'df' with the name 'cleaned_apps_db.csv' using the parameter index=True.

agent("Export all the data from dataframe df to cleaned_apps_db.csv file with an index ")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: I need to use the pandas to_csv() function to export the dataframe to a csv file
Action: python_repl_ast
Action Input: df.to_csv('cleaned_apps_db.csv', index=True)[0m[36;1m[1;3m[0m[32;1m[1;3mI now know the final answer
Final Answer: The data from dataframe df has been successfully exported to the cleaned_apps_db.csv file with an index.[0m

[1m> Finished chain.[0m


{'input': 'Export all the data from dataframe df to cleaned_apps_db.csv file with an index ',
 'output': 'The data from dataframe df has been successfully exported to the cleaned_apps_db.csv file with an index.'}

### Task 12: Display Top 5 Records for playstore_reviews.

Great! Initializing our another agent(OpenAI) is our next step. Once that's done, we'll create a prompt to display the first 5 records from our playstore_reviews table.

In [51]:
#Load the dataset (playstore_reviews.csv) and assign it to a variable 'df1'.
inputfile = "/content/playstore_reviews.csv"

df1 = pd.read_csv(inputfile)
# Initializing the agent1.
agent1 = create_pandas_dataframe_agent(OpenAI(temperature=0),
              df1, verbose=True)

# write your prompt using the agent1
agent1("Dispaly the top 5 records of df1")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: I need to use the head() function to display the top 5 records.
Action: python_repl_ast
Action Input: print(df.head())[0m[36;1m[1;3m                     App                                  Translated_Review  \
0  10 Best Foods for You  I like eat delicious food. That's I'm cooking ...   
1  10 Best Foods for You    This help eating healthy exercise regular basis   
2  10 Best Foods for You                                                NaN   
3  10 Best Foods for You         Works great especially going grocery store   
4  10 Best Foods for You                                       Best idea us   

  Sentiment  Sentiment_Polarity  Sentiment_Subjectivity  
0  Positive                1.00                0.533333  
1  Positive                0.25                0.288462  
2       NaN                 NaN                     NaN  
3  Positive                0.40                0.875000  
4  Positive                1.0

{'input': 'Dispaly the top 5 records of df1',
 'output': 'The top 5 records of df1 are displayed.'}

### Task 13: Determine the Shape(Count Rows and Columns) of the Dataset.
Great! For our next step, let's create a prompt to gather information about the shape of our playstore_reviews dataset. This will help us understand how many rows and columns it contains.

In [32]:
# write your prompt using the agent1
agent1("Let us know the shape of the df1")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: I need to use the shape attribute of the dataframe
Action: python_repl_ast
Action Input: df.shape[0m[36;1m[1;3m(64295, 5)[0m[32;1m[1;3mI now know the final answer
Final Answer: The shape of the df1 is (64295, 5).[0m

[1m> Finished chain.[0m


{'input': 'Let us know the shape of the df1',
 'output': 'The shape of the df1 is (64295, 5).'}

### Task 14: Finding the Missing data.
Wow! We've successfully found out the shape of the dataset. Now, let's create a prompt to ask our agent to help us find details about any missing information in the playstore_reviews table.

In [52]:
# write your prompt using the agent1
agent1("Display the count of missing values of df1")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: I need to use a pandas function to count the missing values in the dataframe
Action: python_repl_ast
Action Input: df.isnull().sum()[0m[36;1m[1;3mApp                           0
Translated_Review         26868
Sentiment                 26863
Sentiment_Polarity        26863
Sentiment_Subjectivity    26863
dtype: int64[0m[32;1m[1;3m I now know the final answer
Final Answer: The count of missing values in df1 is 26868 for the "Translated_Review" column and 26863 for the "Sentiment", "Sentiment_Polarity", and "Sentiment_Subjectivity" columns.[0m

[1m> Finished chain.[0m


{'input': 'Display the count of missing values of df1',
 'output': 'The count of missing values in df1 is 26868 for the "Translated_Review" column and 26863 for the "Sentiment", "Sentiment_Polarity", and "Sentiment_Subjectivity" columns.'}

### Task 15: Identifying Duplicate Rows.
Fantastic! We've successfully found the missing values. Now, let's create a prompt to ask our agent to count the number of duplicate rows in the playstore_reviews dataset.

In [53]:
# write your prompt using the agent1
agent1("Display the count of duplicate rows of the dataset")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: I need to use the pandas dataframe method "duplicated()" to find duplicate rows, and then use the "sum()" method to count them.
Action: python_repl_ast
Action Input: df.duplicated().sum()[0m[36;1m[1;3m33616[0m[32;1m[1;3mI now know the final answer
Final Answer: 33616[0m

[1m> Finished chain.[0m


{'input': 'Display the count of duplicate rows of the dataset',
 'output': '33616'}

### Task 16: Removing Missing values.
Incredible! We were successful in identify the duplicates. Now, let's create a prompt to ask our agent to drop rows with missing values from the playstore_reviews dataset.

In [55]:
# write your prompt using the agent1
### Task 16: Removing Missing values.
### Task 16: Removing Missing values.
agent1("""Remove the rows with null values from dataframe .
        Makes sure the changes are updated to dataframe""")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: I need to use the dropna() function to remove rows with null values and set the inplace parameter to True to update the dataframe.
Action: python_repl_ast
Action Input: df.dropna(inplace=True)[0m[36;1m[1;3m[0m[32;1m[1;3m The rows with null values have been removed and the changes have been updated to the dataframe.
Final Answer: df.dropna(inplace=True)[0m

[1m> Finished chain.[0m


{'input': 'Remove the rows with null values from dataframe .\n        Makes sure the changes are updated to dataframe',
 'output': 'df.dropna(inplace=True)'}

In [56]:
agent1("Find the rows which has missing values dataframe df1")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: I need to use a method to check for missing values in a dataframe
Action: python_repl_ast
Action Input: df.isnull()[0m[36;1m[1;3m         App  Translated_Review  Sentiment  Sentiment_Polarity  \
0      False              False      False               False   
1      False              False      False               False   
3      False              False      False               False   
4      False              False      False               False   
5      False              False      False               False   
...      ...                ...        ...                 ...   
64222  False              False      False               False   
64223  False              False      False               False   
64226  False              False      False               False   
64227  False              False      False               False   
64230  False              False      False               False   

     

{'input': 'Find the rows which has missing values dataframe df1',
 'output': 'There are no rows with missing values in the dataframe df.'}

In [57]:
df1.shape

(37427, 5)

### Task 17: Removing Duplicate Rows.
Great progress! We were successful in dropped the missing values. Now, let's create a prompt to ask our agent to remove duplicate rows from the playstore_reviews dataset.

In [58]:
# write your prompt using the agent1

agent1("""Remove the duplicate rows from the dataframe by keeping the first occurence.
 Make sure that changes are updated to dataframe """)



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: I need to use the drop_duplicates() function from pandas to remove the duplicate rows.
Action: python_repl_ast
Action Input: df.drop_duplicates(keep='first', inplace=True)[0m[36;1m[1;3m[0m[32;1m[1;3mI now know the final answer
Final Answer: df.drop_duplicates(keep='first', inplace=True)[0m

[1m> Finished chain.[0m


{'input': 'Remove the duplicate rows from the dataframe by keeping the first occurence.\n Make sure that changes are updated to dataframe ',
 'output': "df.drop_duplicates(keep='first', inplace=True)"}

In [59]:
df1.shape

(29692, 5)

In [60]:
# export the cleaned dataframe 'df1' with the name 'cleaned_reviews_db.csv' using the parameter index=True.

agent("Export all the data from dataframe df to cleaned_reviews_db.csv file with an index ")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: I need to use the pandas to_csv function to export the dataframe to a csv file
Action: python_repl_ast
Action Input: df.to_csv('cleaned_reviews_db.csv', index=True)[0m[36;1m[1;3m[0m[32;1m[1;3mI need to check if the file was successfully created
Action: python_repl_ast
Action Input: import os[0m[36;1m[1;3m[0m[32;1m[1;3mI can use the os.path.exists function to check if the file exists
Action: python_repl_ast
Action Input: os.path.exists('cleaned_reviews_db.csv')[0m[36;1m[1;3mTrue[0m[32;1m[1;3mI now know the final answer
Final Answer: The data from dataframe df was successfully exported to the cleaned_reviews_db.csv file with an index.[0m

[1m> Finished chain.[0m


{'input': 'Export all the data from dataframe df to cleaned_reviews_db.csv file with an index ',
 'output': 'The data from dataframe df was successfully exported to the cleaned_reviews_db.csv file with an index.'}