![image.jpeg](PlayStore.jpeg)

# **Data Cleaning and Analysis Activities**

**`Note:` it's better to make a copy from the dataframe and test on it before making any changes in the orignal one**

### 1.Which of the following column(s) has/have null values?

Select the columns that you have identified having null/missing values.

In [None]:
 df.isnull().sum()

### 2. Clean the Rating Column and Other Columns Containing Null Values

**Steps:**

0. Try plotting a histogram and a boxplot for this column to understand the issue
1. Remove invalid values from the Rating column (set them as NaN).
2. Fill null values in the Rating column using the mean().
3. Clean any other non-numerical columns by dropping the rows containing null values.
4. Perform the modifications "in place", modifying `df`. If you make a mistake, re-load the data.

**Details:**
- Replace all ratings not in the range of 0 to 5 with NaN.
- Drop rows with null values in other columns.

In [None]:
import numpy as np
df.loc[(df['Rating']<0) | (df['Rating']>5), 'Rating' ] = np.nan #replace invalid numbers with nan
df['Rating'].fillna(df['Rating'].mean(),inplace=True)
df.dropna(inplace=True)

### 3. Clean the Reviews Column and Make It Numeric

You'll notice that some columns from this dataframe which should be numeric, were parsed as object (string). That's because sometimes the numbers are expressed with M, or k to indicate Mega or kilo.

Clean the Reviews column by transforming the values to the correct numeric representation.

For example, 5M should be 5000000.

In [None]:
#Convert the Reviews column to numeric
def reviews(values):
  if isinstance(value, str): #check if yhe value is a string
    if 'M' in value: # Convert "M" (Millions) to an integer
       return int(float(value.replace('M',' '))*1000000)
    elif 'K' in value: #convert "K" (Thousands) to an integer
       return int(float(value.replace('K',' '))*1000)
  return int(value) # Keep numeric values unchanged
df['reviews'] = df['reviews'].apply(reviews)

### 4. Count the Number of Duplicated Apps

Count the number of duplicated rows. That is, if the app Twitter appears 2 times, that counts as 2.

In [None]:
df.duplicated(subset=['App']).sum()

### 5. Drop Duplicated Apps Keeping Only the Ones with the Greatest Number of Reviews

Now that the Reviews column is numeric, we can use it to clean duplicated apps. Drop duplicated apps, keeping just one copy of each, the one with the greatest number of reviews.

Hint: you'll need to sort the dataframe by App and Reviews, and that will change the order of your df.

In [None]:
df = df.sort_values(by['App', 'reviews'],ascending[True,False])
df = df.drop_duplicates(subset=['App'], keep ='first') # Drop duplicate apps while keeping the one with the highest number of reviews

### 6. Format the Category Column

Categories are all uppercase and words are separated using underscores. Instead, we want them with capitalized in the first character and the underscores transformed as whitespaces.

Example, the category AUTO_AND_VEHICLES should be transformed to: Auto and vehicles. Also, if you find any other wrong value transform it into an Unknown category.

In [None]:
df['Category'] = df['Category'].str.replace('_',' ').str.title() # Replace underscores with spaces and capitalize the first letter of each word
valid_categories = df['Category'].unique()
df.loc[~df['Category'].isin(valid_categories), 'Category'] = 'Unknown' # Replace any invalid categories with 'Unknown'

### 7. Clean and Convert the Installs Column to Numeric Type

Clean and transform Installs as a numeric type. Some values in Installs will have a + modifier. Just remove the string and honor the original number (for example +2,500 or 2,500+ should be transformed to the number 2500).

In [None]:
df['Installs'] = df['Installs'].str.replace('+',' ').str.replace(',',' ').astype(int)

### 8. Clean and Convert the Size Column to Numeric (Representing Bytes)

The Size column is of type object. Some values contain either a M or a k that indicate Kilobytes (1024 bytes) or Megabytes (1024 kb). These values should be transformed to their corresponding value in bytes. For example, 898k will become 919552 (898 * 1024).

Some other values are completely invalid (there's no way to infer the numeric type from them). For these, just replace the value for 0.

Some other rules are related to + modifiers, apply the same rules as the previous task.

In [None]:
def convert_size(value):
    if 'M' in value:
        return int(float(value.replace('M', '')) * 1024 * 1024)  # Convert MB to bytes
    elif 'K' in value:
        return int(float(value.replace('K', '')) * 1024)  # Convert KB to bytes
    elif value == 'Varies with device':
        return 0
    else:
        return int(value)  # Convert numeric strings to integers

df['Size'] = df['Size'].apply(convert_size)


### 9. Clean and Convert the Price Column to Numeric

Values of the Price column are strings representing price with special symbol '$'.

In [None]:
# Remove '$'
df['Price'] = df['Price'].str.replace('$', '')


### 10. Paid or Free?

Now that you have cleaned the Price column, let's create another auxiliary Distribution column.

This column should contain Free/Paid values depending on the app's price.

In [None]:
def classify_app(price):
    if price == 0:
        return 'Free'
    else:
        return 'Paid'

df['Type'] = df['Price'].apply(classify_app)


## Finally!!!
- Now all is left is to save the new dataframe we made into a new csv file called `filteredplaystore.csv`

In [None]:
df.to_csv('filteredplaystore.csv',index = False) # :)