# DATA COLLECTION
Data collection allows us to identify and gather the data that is used to train our machine learning model. The data collected should be clear, accurate and without any kind of bias. Any kind of error within the data would be carried on into the subsequent stages of the machine learning model it gets implemented in.

In [1]:
# Pandas is a python library which provides the user with tools to analyze and manipulate data.
import pandas as pd

**Series is used to represent one dimensional ndarray with axis labels**

*`pandas.Series(data=None, index=None, dtype=None, name=None, copy=False)`*

**data** = the dataset. It can be series, dictionary, scalar value and is iterable

**index** = it can be an array or an 1D index that is used to change the default index naming system (0, 1, 2, 3...) to an user input **index** (e.g., 1A, 2B, 3C...)

**dtype** = This is the data type represented in the output. This can be provided as a string inout or a numpy datatype (numpy.dtype)

**name** = Provides the input name to the series

**copy** = Copies input data. It is boolean type and by default False

In [2]:
artists = ["Eminem", "JAY-Z", "Mariah Carey", "P!nk", "Backstreet Boys"]    # 1D array dataset
artist_series = pd.Series(artists)  # Represent the artist dataset using the Series function
print(artist_series)
# print(type(artist_series))    // <class 'pandas.core.series.Series'>

0             Eminem
1              JAY-Z
2       Mariah Carey
3               P!nk
4    Backstreet Boys
dtype: object


**DataFrame is used to represent two-dimensional dataset in a tabular fashion**

*`pandas.DataFrame(data=None, index=None, columns=None, dtype=None, copy=None)`*

**data** = This is the dataset. It can be a ndarray, dictionary or DataFrame itself

**index** = it can be an array or an 1D index that is used to change the default index naming system (0, 1, 2, 3...) to an user input index (e.g., 1A, 2B, 3C...)

**columns**=This represent the column label. By default it is 0, 1, 2, 3... A list/array can be entered to use as custom column label

**name** = Provides the input name to the series

**copy** = Copies input data. It is boolean type and by default False

In [10]:
# We will provide the data as a dictionary, where the key elements are the column names of the dataframe
# The values are the variables for each instance
artists = {"ARTIST": ["Britney Spears", "blink-182", "Faith Hill", "Bon Jovi", "*NSYNC"],
           "SONG": ["Oops!...I Did It Again", "All The Small Things", "Breathe", "It's My Life", "Bye Bye Bye"],
           "DURATION": [211160, 167066, 250546, 224493, 200560],
           "EXPLICIT":["FALSE", "FALSE", "FALSE", "FALSE", "FALSE"],
           "YEAR": [2000, 1999, 1999, 2000, 2000],
           "POPULARITY": [77, 79, 66, 78, 65]}
artist_dataframe = pd.DataFrame(artists)
print(artist_dataframe)
# print(type(artist_dataframe))     // <class 'pandas.core.frame.DataFrame'>

           ARTIST                    SONG  DURATION EXPLICIT  YEAR  POPULARITY
0  Britney Spears  Oops!...I Did It Again    211160    FALSE  2000          77
1       blink-182    All The Small Things    167066    FALSE  1999          79
2      Faith Hill                 Breathe    250546    FALSE  1999          66
3        Bon Jovi            It's My Life    224493    FALSE  2000          78
4          *NSYNC             Bye Bye Bye    200560    FALSE  2000          65


In [11]:
# We will now perform the same result as above, but this time, we have used lists instead of dictionaries.
# The list elements are the same as present in the previous dictionary as value elements
# The key element, which were used as column names in the previous dataframe, are set here using the column parameter
labels = ["ARTIST", "SONG", "DURATION", "EXPLICIT", "YEAR", "POPULARITY"]  # Provided the column names
artists = [["Britney Spears", "Oops!...I Did It Again", 211160, "FALSE", 2000, 77],
           ["blink-182", "All The Small Things", 167066, "FALSE", 1999, 79],
           ["Faith Hill", "Breathe", 250546, "FALSE", 1999, 66],
           ["Bon Jovi", "It's My Life", 224493, "FALSE", 2000, 78],
           ["*NSYNC", "Bye Bye Bye", 200560, "FALSE", 2000, 65]]  # Provided the variables
artist_dataframe_column = pd.DataFrame(artists, columns = labels)
print(artist_dataframe_column)

           ARTIST                    SONG  DURATION EXPLICIT  YEAR  POPULARITY
0  Britney Spears  Oops!...I Did It Again    211160    FALSE  2000          77
1       blink-182    All The Small Things    167066    FALSE  1999          79
2      Faith Hill                 Breathe    250546    FALSE  1999          66
3        Bon Jovi            It's My Life    224493    FALSE  2000          78
4          *NSYNC             Bye Bye Bye    200560    FALSE  2000          65


In [22]:
# We can use read_csv method to read the data from a CSV file
artist_csv = pd.read_csv("data/songs_normalize.csv")
print(artist_csv)

              artist                                    song  duration_ms  \
0     Britney Spears                  Oops!...I Did It Again       211160   
1          blink-182                    All The Small Things       167066   
2         Faith Hill                                 Breathe       250546   
3           Bon Jovi                            It's My Life       224493   
4             *NSYNC                             Bye Bye Bye       200560   
...              ...                                     ...          ...   
1995  Jonas Brothers                                  Sucker       181026   
1996    Taylor Swift                            Cruel Summer       178426   
1997    Blanco Brown                              The Git Up       200593   
1998       Sam Smith  Dancing With A Stranger (with Normani)       171029   
1999     Post Malone                                 Circles       215280   

      explicit  year  popularity  danceability  energy  key  loudness  mode

In [23]:
# To read data from an Excel spreadsheet, we use the read_excel method
member_excel = pd.read_excel("data/songs_normalize.xlsx")
print(member_excel)

# To read a specific spreadsheet from a workbook, specify the name using the "sheet_name" parameter

# artist_excel_second = pd.read_excel("songs_normalize.xlsx", sheet_name="Second sheet")
# print(artist_excel_second)

              artist                                    song  duration_ms  \
0     Britney Spears                  Oops!...I Did It Again       211160   
1          blink-182                    All The Small Things       167066   
2         Faith Hill                                 Breathe       250546   
3           Bon Jovi                            It's My Life       224493   
4             *NSYNC                             Bye Bye Bye       200560   
...              ...                                     ...          ...   
1995  Jonas Brothers                                  Sucker       181026   
1996    Taylor Swift                            Cruel Summer       178426   
1997    Blanco Brown                              The Git Up       200593   
1998       Sam Smith  Dancing With A Stranger (with Normani)       171029   
1999     Post Malone                                 Circles       215280   

      explicit  year  popularity  danceability  energy  key  loudness  mode