# Musical instruments prices
### A study of the prices of musical instruments in Sri Lanka

In [1]:
import numpy as np
import pandas as pd

In [2]:
df = pd.read_csv("raw-data.csv", encoding="utf-8")

## Cleaning the dataset

Let's take a look at data that we imported from the csv

In [3]:
df.head(3)

Unnamed: 0,Title,Sub_title,Price,Instrument_Type,Condition,Location,Description,Post_URL,Seller_name,Seller_type,published_date
0,Yamaha (SY-77) Music Synthesizer for sale,"Posted on 04 Oct 7:11 pm, Ja-Ela, Gampaha","Rs 39,000",Keyboard / Piano,Used,"Ja-Ela, Gampaha",Â°â¢Â°Sri Lanka's Largest Digital Piano Selle...,https://ikman.lk/en/ad/yamaha-sy-77-music-synt...,Seven Star International,Member,2021-10-04 19:11:00
1,SRX-718 BASS BIN (PAIR) for sale,"Posted on 10 Oct 7:54 pm, Kadawatha, Gampaha","Rs 77,500",Studio / Live Music Equipment,New,"Kadawatha, Gampaha",â¡Watts 3200â¡â¡Treated Plywoodâ¡,https://ikman.lk/en/ad/srx-718-bass-bin-pair-f...,Sasiru Super Sonics,Member,2021-10-10 19:54:00
2,Piano (Malcom Mendis Piano) for sale,"Posted on 13 Oct 12:43 pm, Kandana, Gampaha","Rs 130,000",Keyboard / Piano,Used,"Kandana, Gampaha","Sri Lanka's Biggest Piano Sale, Reasonable pri...",https://ikman.lk/en/ad/piano-malcom-mendis-pia...,Sell Fast | à¶à¶³à·à¶± | MCI Ikman à¶¯à·à¶±...,Member,2021-10-13 12:43:00


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5167 entries, 0 to 5166
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Title            5167 non-null   object
 1   Sub_title        5167 non-null   object
 2   Price            5167 non-null   object
 3   Instrument_Type  5167 non-null   object
 4   Condition        5167 non-null   object
 5   Location         5167 non-null   object
 6   Description      5167 non-null   object
 7   Post_URL         5167 non-null   object
 8   Seller_name      5167 non-null   object
 9   Seller_type      5167 non-null   object
 10  published_date   5167 non-null   object
dtypes: object(11)
memory usage: 444.2+ KB


We can se that we have no missing values on this dataset. However, al values appear as "object" despite some of them are numbers or dates. We have also two variables,"Condition" and "Seller_type", that seem to be binary. Let's check that before moving on

In [5]:
print(df.Condition.unique())
print(df.Seller_type.unique())

['Used' 'New']
['Member' 'Premium-Member']


As suspected, both variables are binaries. We will change those of a new pair of variables that take 1s and 0s as it's possible values

**Binary variables**

In [6]:
# We are convert the condition type to a boolean variable with 1s and 0s

def textToBoolean(condition, yesval, noval):
    if condition == yesval:
        return 1
    elif condition == noval:
        return 0
    else:
        return null
df["Is_new"] = df["Condition"].apply(lambda x: textToBoolean(x, "New", "Used"))
df["Premium_seller"] = df["Seller_type"].apply(lambda x: textToBoolean(x, "Member", "Premium-Member"))

**Prices**

In [7]:
# We want to convert prices to numbers:

def parsePrice(text):
    text = text.replace("Rs ","")
    text = text.replace(",","")
    return int(text)

df["Price_value"] = df["Price"].apply(lambda x: parsePrice(x))

**Dates**

In [8]:
# Now, let's convert the date strings in "published_date" to datetime objects

df["Published"] = pd.to_datetime(df["published_date"], format="%Y-%m-%d %H:%M:%S")


# Let's check if variables have been created ok:
print("Is_new values:", df.Is_new.unique())
print("Premium_seller values:", df.Is_new.unique())


Is_new values: [0 1]
Premium_seller values: [0 1]


**Title**
We can see that text fields are filled with strange characters. # If we take a look at the "Description" column, we will see some weird characters mixed with the text

In [9]:
# Let's se an example:
badtext = df.iloc[0].Description
badtext

"Â°â\x80¢Â°Sri Lanka's Largest Digital Piano SellerÂ°â\x80¢Â° Â°â\x80¢Â° Direct Imported Â°â\x80¢Â° Fully Functional and ready to Use Â°â\x80¢Â° Cosmetics : 10/10Â°â\x80¢Â° Ideal for an Hotelier or For an keen learner.Â°â\x80¢Â° 6 months of  WarrantyÂ°â\x80¢Â° Furnished to the OptimumÂ°â\x80¢Â° At Brand New Conditionâ\x80¢Â°â\x80¢ The Art of Honour Lasting Values Â® â\x80¢Â°â\x80¢"

In [10]:
# We are going to make a list of the characters we want to remove and then
# we will create a function that will replace those characters with an empty string

badchars = ["Â","\x80¢","°","â","®","¡","à", "¶","±", "ð"]

def cleanText(text, badchar_list):
    newtext = text
    for char in badchar_list:
        newtext = newtext.replace(char,"")
    return newtext

# In this example we se many of the characters dissapearing, but most of
# the description entries are full of added substrings with seemingly random
# patterns, so it is difficult to easyly clean them all with a simple script.

goodtext = cleanText(badtext, badchars)
goodtext

"Sri Lanka's Largest Digital Piano Seller  Direct Imported  Fully Functional and ready to Use  Cosmetics : 10/10 Ideal for an Hotelier or For an keen learner. 6 months of  Warranty Furnished to the Optimum At Brand New Condition The Art of Honour Lasting Values  "

In [11]:
# We apply the changes to the dataframe
df["Description"] = df["Description"].apply(lambda x: cleanText(x, badchars))
df["Title"] = df["Title"].apply(lambda x: cleanText(x, badchars))

But, even though this cleans some of the fields, there are many that are plain unusable. Most of the descriptions are filleds with strange characters, but only a few title fields are completely unusable. We can se it is only about 70 entries:

In [12]:
# Those rows cannot be used, so we drop them
df = df.sort_values(["Title"], ascending=False)
df[["Title", "Description"]].iloc[60:75]

Unnamed: 0,Title,Description
2417,··½ »§· ·§ ·» for sale,··½ »§···· ·¸ » ½¯ ·§ ·».... ··...
3158,··· ·§ ·» for sale,"§ ···º ···,···¹,¸·½·½ ,··,..."
4548,½·­·· ·· ½¯ MP3 TRACKS for sale,· ·­ ·º ···¯·····¯º ·¯·· 201...
2490,½·­·¸ ·§·»· for sale,"½·­· ···· ·§·»·,··©·,·º½··,..."
150,½· 15 ··½· §·´· for sale,··¯§¸ ­·º··½· 15 ··½· §·´· ¯·...
4532,½· 15 §·´· 2 · for sale,··´·»·º§¸ ­.½· 15 ·½· 2 ·
4992,FERNANDES 4 String Bass Guitar for sale,··¯§¸ ­·º···. normal volume control 3·...
3140,yamaha wireless mic for sale,Brand newWireless mic
136,yamaha stage custom for sale,Import yamaha stage custom sell pack Good soun...
3870,yamaha speakers for sale,Brand new condition


In [13]:
df.drop(df.index[:70], inplace=True)
df[["Title", "Description"]].iloc[:10]

Unnamed: 0,Title,Description
2486,yamaha psr2000 for sale,all keyboards are not in working condition.goo...
2072,yamaha psr e 463 for sale,Brand new conditionUsb midi track playAudio re...
3540,yamaha psr e 403 for sale,"··¯¸ ­­··º· ­Manual, software cd ­"
1230,yamaha piano for sale,Yamaha piano for saleGood condition Call for m...
2866,yamaha pasifica 112j for sale,Giurat eke middel pic up eka weda na
1884,yamaha mixer 6 chanel for sale,yamaha mixer 6 chanelorginel japan use japan f...
1051,yamaha hs 8 for sale,yamaha hs 8 from USA Brand new condition 100%
3982,yamaha double top speakers for sale,Super quality soundsGood low Perfect for outdo...
2144,yamaha csr225 for sale,No errorsGood condition full set
2024,yamaha bbn5 japan bass Guitar for sale,yamaha bbn5 japan bass Guitarfrom japan


In [14]:
# We take a final look at the dataset
new_cols = ["Is_new", "Price_value", "Premium_seller","Published", "Description"]
df[new_cols].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5097 entries, 2486 to 2808
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   Is_new          5097 non-null   int64         
 1   Price_value     5097 non-null   int64         
 2   Premium_seller  5097 non-null   int64         
 3   Published       5097 non-null   datetime64[ns]
 4   Description     5097 non-null   object        
dtypes: datetime64[ns](1), int64(3), object(1)
memory usage: 238.9+ KB


Everything looks fine, so we can now proceed to analyze the dataset

In [15]:
# We saved the processed data
# df.to_csv("processed-data.csv")