# Checkpoint Three: Cleaning Data

Now you are ready to clean your data. Before starting coding, provide the link to your dataset below.

My dataset: https://www.kaggle.com/datasets/thomaskonstantin/top-850-guitar-tabs?select=gutiarDB.csv

Import the necessary libraries and create your dataframe(s).

In [10]:
import pandas as pd
import numpy as np
import matplotlib as plt
import seaborn as sns

df = pd.read_csv('guitarDB.csv')   #importing all libraries and dataset.

df.head()  # viewing columns and understanding what the data looks like.

Unnamed: 0,Artist,Song Name,Song Rating,Song Hits,Page Type,Difficulty,Key,Capo,Tuning
0,Jeff Buckley,Hallelujah,40045,31174526,Chords,novice,Db,1st fret,E A D G B E
1,Ed Sheeran,Perfect,31694,25794778,Chords,novice,Ab,1st fret,E A D G B E
2,John Legend,All Of Me,20169,25653362,Chords,novice,Fm,1st fret,E A D G B E
3,Passenger,Let Her Go,17267,24556593,Chords,novice,Em,7th fret,E A D G B E
4,Led Zeppelin,Stairway To Heaven,11839,20762763,Tab,intermediate,Am,No Capo,E A D G B E


## Missing Data

Test your dataset for missing data and handle it as needed. Make notes in the form of code comments as to your thought process.

In [11]:
# checking for null/missing data - will check by having pandas locate columns with null values.
df.info(verbose=True)


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 850 entries, 0 to 849
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Artist       850 non-null    object
 1   Song Name    850 non-null    object
 2   Song Rating  850 non-null    object
 3   Song Hits    850 non-null    object
 4   Page Type    850 non-null    object
 5   Difficulty   850 non-null    object
 6   Key          850 non-null    object
 7   Capo         850 non-null    object
 8   Tuning       850 non-null    object
dtypes: object(9)
memory usage: 59.9+ KB


## Irregular Data

Detect outliers in your dataset and handle them as needed. Use code comments to make notes about your thought process.

In [13]:
df['Artist'].describe()
# Running by each column - Artist: Yes this tells me the top, but it also tells me out of 850 data entries, 
#342 of the artists are UNIQUE.

count              850
unique             342
top       Taylor Swift
freq                27
Name: Artist, dtype: object

In [14]:
df['Song Name'].describe()
# Running by each column - Song Name: This tells me the top song, and how many times it appears. 752 Unique songs - 
#obviously Wonderwall is a duplicate. Possible others who are duplicated less than 4 times?

count             850
unique            752
top       Wonderwall 
freq                4
Name: Song Name, dtype: object

In [22]:
df['Song Rating'].describe()
# Running by each column - Song Rating: Top rated song was given 2,473 of the highest votes.
# I find the 'unique' curious: is that how many different users rated it, or how many unique songs got ratings?
# The frequency of 4...could be possible 4 star ratings to define the rating system scale.

count       850
unique      751
top       2,473
freq          4
Name: Song Rating, dtype: object

In [16]:
df['Song Hits'].describe()
# Running by each column - Song Hits: THis one outed a duplicate. 850 songs, and 849 of them are unique.

count           850
unique          849
top       1,556,006
freq              2
Name: Song Hits, dtype: object

In [17]:
df['Page Type'].describe()
# Running by each column - Page Type: This means Tabs or Chord notations. It appears users prefer
#Chords instead of tabulature.

count        850
unique         5
top       Chords
freq         680
Name: Page Type, dtype: object

In [18]:
df['Difficulty'].describe()
# Running by each column - Song Difficulty: 6 Unique 'difficulty values' most are intermediate, 
# and that seems to appear 510 times.

count              850
unique               6
top       intermediate
freq               510
Name: Difficulty, dtype: object

In [19]:
df['Key'].describe()
# Running by each column - Song Key: This should be more limited, as there are only X amount of possibilities.
# THe TOP Key is in Db (D-Flat), which appeared 235 times - or roughly about 25% of the total.

count     850
unique     30
top        Db
freq      235
Name: Key, dtype: object

In [23]:
df['Capo'].describe()
# Running by each column - Guitar Capo Device: Capos are things that go onto a guitar neck to change
# the fretboard tones - this also expands the notes and sounds a guitar can make by holding down all 
# of the strings at once. Over half of the songs are No Capo preferred.  
# If an advertiser wanted to offer a learn to Capo, this would be a target placement for the ad. 
# If an advertiser wanted to offer a new kind of capo, reverse to placements to everyone except this group.

count         850
unique         16
top       No Capo
freq          549
Name: Capo, dtype: object

In [21]:
df['Tuning'].describe()
# Running by each column - Guitar Tuning (Standard and other): Altering the tune by turning the pegs on
#the guitar can make even more sounds and styles possible. It is no surprise that the TOP tuning is the
#standard EADGBE format- but what is curious is of an 850 count, 234 were custom/non-standard tunings!
# Hypothesis-- those 234 are Difficult/ hard skill level.

count              850
unique              18
top        E A D G B E
freq               616
Name: Tuning, dtype: object

## Unnecessary Data

Look for the different types of unnecessary data in your dataset and address it as needed. Make sure to use code comments to illustrate your thought process.

In [None]:
#Nearly every piece of data has content.


## Inconsistent Data

Check for inconsistent data and address any that arises. As always, use code comments to illustrate your thought process.

## Summarize Your Results

Make note of your answers to the following questions.

1. Did you find all four types of dirty data in your dataset?
2. Did the process of cleaning your data give you new insights into your dataset?
3. Is there anything you would like to make note of when it comes to manipulating the data and making visualizations?