**Problem:**

You are given the following dataset:
1. **Audible Data** : https://1drv.ms/u/s!AiqdXCxPTydhoog8ckLN-6Cw55fzIg?e=EWgZ5d

Your task is to:
- Find the problems with the datasets.
- Define the Data Quality Dimensions.
- Try to clean the datasets.

In [65]:
import pandas as pd
import numpy as np
df = pd.read_csv('audible_cleaned.csv')
df

Unnamed: 0,name,author,narrator,time,releasedate,language,stars,price,ratings
0,Geronimo Stilton #11 & #12,GeronimoStilton,BillLobely,140,2008-04-08,English,5.0,468.0,34.0
1,The Burning Maze,RickRiordan,RobbieDaymond,788,2018-01-05,English,4.5,820.0,41.0
2,The Deep End,JeffKinney,DanRussell,123,2020-06-11,English,4.5,410.0,38.0
3,Daughter of the Deep,RickRiordan,SoneelaNankani,676,2021-05-10,English,4.5,615.0,12.0
4,"The Lightning Thief: Percy Jackson, Book 1",RickRiordan,JesseBernstein,600,2010-01-13,English,4.5,820.0,181.0
...,...,...,...,...,...,...,...,...,...
87484,Last Days of the Bus Club,ChrisStewart,ChrisStewart,454,2017-09-03,English,0.0,596.0,0.0
87485,The Alps,StephenO'Shea,RobertFass,607,2017-02-21,English,0.0,820.0,0.0
87486,The Innocents Abroad,MarkTwain,FloGibson,1144,2016-12-30,English,0.0,938.0,0.0
87487,A Sentimental Journey,LaurenceSterne,AntonLesser,248,2011-02-23,English,0.0,680.0,0.0


## Problem with the dataset:
- `name`:
    - Some books has the version in different formats.
        - [x] with hast-tag like "Geronimo Stilton #11 & #12".
        - [x] "Magic Tree House Collection: Books 9-16".
        - [x] "The 39 Clues, Book 6"
    - â€™, Ã¤, Ã¼ values
        - [x] Some rows (9, 34, 157, 162, 170...) have this strange part.


- `author` & `narrator`:
    - [x] Every value starts with "Writtenby:" e.g., "Writtenby:GeronimoStilton"
    - [x] Some values represent 2 or more authors, e.g., "Writtenby:JuliaDonaldson,AxelScheffler"
    - [x] Some values also have the strange part, e.g., "Writtenby:FranciscoDÃ­azValladares"
    - [x] First name and the last are not separated with a white space, e.g. "Writtenby:NicolasGorny".
    - [x] Some additional informations are also inclused in some values. E.g., "Writtenby:AndrewPeterson-editor,JonathanRogers,N.D.Wilson,"
    - [x] There are no proper names in narrator, e.g., "Narratedby:uncredited".


- `time`:
    - [x] The values are combination of total hour and minutes, e.g., "2 hrs and 20 mins", "10 hrs", "22 mins"


- `releasedate`:
    - [x] There are 2 types entries, "08-04-2008" and "13-01-10"
    - [x] The dtype is object, if we convert to datetime object, then the above will be resolved.


- `language`
    - [x] Some values are in title formed and some are in lower case, e.g. "English" and "german".


- `stars`
    - [x] Total ratings informations are included along with average stars, e.g. "5 out of 5 stars34 ratings"
    - [x] The highest rating is 5 and lower is 1. How many avg. stargs got a book is included as long form.
    - [x] Some empty values are represented as "Not rated yet". For that, we can assume total ratings = 0 and avg. ratings = 0.


- `price`
    - [x] There is a value, "Free". This also change the dtype of the column.

In [66]:
import numpy as np
import pandas as pd

import re

In [67]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 87489 entries, 0 to 87488
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   name         87489 non-null  object 
 1   author       87489 non-null  object 
 2   narrator     87489 non-null  object 
 3   time         87489 non-null  int64  
 4   releasedate  87489 non-null  object 
 5   language     87489 non-null  object 
 6   stars        87489 non-null  float64
 7   price        87489 non-null  float64
 8   ratings      87489 non-null  float64
dtypes: float64(3), int64(1), object(5)
memory usage: 6.0+ MB


In [68]:
# convert the "releasedate" col to datetime object
df['releasedate'] = pd.to_datetime(df['releasedate'])

In [69]:
# remove the value "free" from the col price. This will automatically
# convert to the float values.
df['price'] = df['price'].replace('Free',0)

In [70]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 87489 entries, 0 to 87488
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   name         87489 non-null  object        
 1   author       87489 non-null  object        
 2   narrator     87489 non-null  object        
 3   time         87489 non-null  int64         
 4   releasedate  87489 non-null  datetime64[ns]
 5   language     87489 non-null  object        
 6   stars        87489 non-null  float64       
 7   price        87489 non-null  float64       
 8   ratings      87489 non-null  float64       
dtypes: datetime64[ns](1), float64(3), int64(1), object(4)
memory usage: 6.0+ MB


In [75]:
# convert the "language" col to titlecase
df["language"] = df["language"].str.title()