**Problem:**

You are given the following dataset:
1. **Audible Data** : https://1drv.ms/u/s!AiqdXCxPTydhoog8ckLN-6Cw55fzIg?e=EWgZ5d

Your task is to:
- Find the problems with the datasets.
- Define the Data Quality Dimensions.
- Try to clean the datasets.

In [53]:
import numpy as np
import pandas as pd

In [54]:
auidable = pd.read_csv('datasets/audible_uncleaned.csv')
auidable

Unnamed: 0,name,author,narrator,time,releasedate,language,stars,price
0,Geronimo Stilton #11 & #12,Writtenby:GeronimoStilton,Narratedby:BillLobely,2 hrs and 20 mins,04-08-08,English,5 out of 5 stars34 ratings,468.00
1,The Burning Maze,Writtenby:RickRiordan,Narratedby:RobbieDaymond,13 hrs and 8 mins,01-05-18,English,4.5 out of 5 stars41 ratings,820.00
2,The Deep End,Writtenby:JeffKinney,Narratedby:DanRussell,2 hrs and 3 mins,06-11-20,English,4.5 out of 5 stars38 ratings,410.00
3,Daughter of the Deep,Writtenby:RickRiordan,Narratedby:SoneelaNankani,11 hrs and 16 mins,05-10-21,English,4.5 out of 5 stars12 ratings,615.00
4,"The Lightning Thief: Percy Jackson, Book 1",Writtenby:RickRiordan,Narratedby:JesseBernstein,10 hrs,13-01-10,English,4.5 out of 5 stars181 ratings,820.00
...,...,...,...,...,...,...,...,...
87484,Last Days of the Bus Club,Writtenby:ChrisStewart,Narratedby:ChrisStewart,7 hrs and 34 mins,09-03-17,English,Not rated yet,596.00
87485,The Alps,Writtenby:StephenO'Shea,Narratedby:RobertFass,10 hrs and 7 mins,21-02-17,English,Not rated yet,820.00
87486,The Innocents Abroad,Writtenby:MarkTwain,Narratedby:FloGibson,19 hrs and 4 mins,30-12-16,English,Not rated yet,938.00
87487,A Sentimental Journey,Writtenby:LaurenceSterne,Narratedby:AntonLesser,4 hrs and 8 mins,23-02-11,English,Not rated yet,680.00


**audible uncleand**
1. The `Author` column contains values with a "Writtenby:" prefix, such as "Writtenby:RickRiordan," which is not a standard format for author names, representing a validity issue categorized as a formatting problem.

2. The `Narrator` column includes values with a "Narratedby:" prefix, such as "Narratedby:BillLobely," which is inconsistent with standard narrator name formats, representing a validity issue categorized as a formatting problem.

3. The `Time` column stores duration values as text in English, such as "2 hrs and 20 mins," and is recorded as an object data type, making it unsuitable for numerical analysis, representing a validity issue categorized as a formatting and data type problem.

4. The `Stars` column combines rating and total ratings in a single field, such as "4.5 out of 5 stars (100 ratings)," or contains text like "Not rated yet," which is inconsistent and non-numeric, representing a validity issue categorized as a formatting and data structure problem.

5. The `Price` column is stored as an object data type, likely due to currency symbols or text entries like "$9.99" or "Free," instead of a numeric float type required for calculations, representing a validity issue categorized as a data type problem.

6. The `Price` column contains a mix of formats, with some values in numeric format like "9.99" and others potentially including currency symbols or text, leading to inconsistent data representation, representing a validity issue categorized as a formatting problem.
 

In [55]:
# corret author column
auidable['author']=auidable['author'].str.replace('Writtenby:','').str.strip()


In [56]:
# correct narroter column 
auidable['narrator']=auidable['narrator'].str.replace('Narratedby:','').str.strip()

In [57]:
auidable

Unnamed: 0,name,author,narrator,time,releasedate,language,stars,price
0,Geronimo Stilton #11 & #12,GeronimoStilton,BillLobely,2 hrs and 20 mins,04-08-08,English,5 out of 5 stars34 ratings,468.00
1,The Burning Maze,RickRiordan,RobbieDaymond,13 hrs and 8 mins,01-05-18,English,4.5 out of 5 stars41 ratings,820.00
2,The Deep End,JeffKinney,DanRussell,2 hrs and 3 mins,06-11-20,English,4.5 out of 5 stars38 ratings,410.00
3,Daughter of the Deep,RickRiordan,SoneelaNankani,11 hrs and 16 mins,05-10-21,English,4.5 out of 5 stars12 ratings,615.00
4,"The Lightning Thief: Percy Jackson, Book 1",RickRiordan,JesseBernstein,10 hrs,13-01-10,English,4.5 out of 5 stars181 ratings,820.00
...,...,...,...,...,...,...,...,...
87484,Last Days of the Bus Club,ChrisStewart,ChrisStewart,7 hrs and 34 mins,09-03-17,English,Not rated yet,596.00
87485,The Alps,StephenO'Shea,RobertFass,10 hrs and 7 mins,21-02-17,English,Not rated yet,820.00
87486,The Innocents Abroad,MarkTwain,FloGibson,19 hrs and 4 mins,30-12-16,English,Not rated yet,938.00
87487,A Sentimental Journey,LaurenceSterne,AntonLesser,4 hrs and 8 mins,23-02-11,English,Not rated yet,680.00


In [58]:
import re




# Step 2: Define function to convert text or numeric duration to minutes
def convert_time_to_minutes(time_val):
    if pd.isna(time_val):  # Handle missing values
        return np.nan
    
    # Handle numeric inputs (int or float, assumed to be minutes)
    if isinstance(time_val, (int, float)):
        return float(time_val)
    
    # Handle string inputs
    try:
        time_str = str(time_val).lower().replace('and', '').strip()
        
        # Initialize hours and minutes
        hours, minutes = 0, 0
        
        # Extract hours (number before 'hr' or 'hrs')
        hr_match = re.search(r'(\d+)\s*hr', time_str)
        hours = int(hr_match.group(1)) if hr_match else 0
        
        # Extract minutes (number before 'min' or 'mins')
        min_match = re.search(r'(\d+)\s*min', time_str)
        minutes = int(min_match.group(1)) if min_match else 0
        
        # Return total minutes
        return hours * 60 + minutes
    except (ValueError, AttributeError):
        # Return NaN for malformed inputs (e.g., 's 20 mins')
        return np.nan

    

# Step 3: Apply conversion to create new column
auidable['time'] = auidable['time'].apply(convert_time_to_minutes)



In [59]:
# change astype
auidable['time']=auidable['time'].astype(int)

In [65]:
# date to datetime
auidable['releasedate']=pd.to_datetime(auidable['releasedate'])

  auidable['releasedate']=pd.to_datetime(auidable['releasedate'])


In [70]:
auidable['stars']=auidable['stars'].replace("Not rated yet",0)
auidable

Unnamed: 0,name,author,narrator,time,releasedate,language,stars,price
0,Geronimo Stilton #11 & #12,GeronimoStilton,BillLobely,140,2008-04-08,English,5 out of 5 stars34 ratings,468.00
1,The Burning Maze,RickRiordan,RobbieDaymond,788,2018-01-05,English,4.5 out of 5 stars41 ratings,820.00
2,The Deep End,JeffKinney,DanRussell,123,2020-06-11,English,4.5 out of 5 stars38 ratings,410.00
3,Daughter of the Deep,RickRiordan,SoneelaNankani,676,2021-05-10,English,4.5 out of 5 stars12 ratings,615.00
4,"The Lightning Thief: Percy Jackson, Book 1",RickRiordan,JesseBernstein,600,2010-01-13,English,4.5 out of 5 stars181 ratings,820.00
...,...,...,...,...,...,...,...,...
87484,Last Days of the Bus Club,ChrisStewart,ChrisStewart,454,2017-09-03,English,0,596.00
87485,The Alps,StephenO'Shea,RobertFass,607,2017-02-21,English,0,820.00
87486,The Innocents Abroad,MarkTwain,FloGibson,1144,2016-12-30,English,0,938.00
87487,A Sentimental Journey,LaurenceSterne,AntonLesser,248,2011-02-23,English,0,680.00
