# **Second section**


*   Hacer un análisis de calidad de datos sobre *dataset.csv*
*   Generar un reporte de los resultados.

### Note:  encontrar todas las anomalías de calidad de datos del dataset. No deberá corregir las anomalías que encuentre, solo encontrarlas y justificarlas en el reporte.
Entregable: Reporte de calidad de datos del dataset, adicionalmente puede incluir el código que muestre cómo encontró las anomalías de calidad de datos.

In [1]:
## Importing libraries

import pandas as pd
import json

In [2]:
## Mounting Google Drive in the Colab environment:

from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [4]:
# Path to the CSV file
file_path = '/content/drive/My Drive/Python/R5/dataset.csv'

# Read the CSV file into a DataFrame named 'dataset'
df = pd.read_csv(file_path)

df.head(3)

Unnamed: 0,disc_number,duration_ms,explicit,track_number,track_popularity,track_id,track_name,audio_features.danceability,audio_features.energy,audio_features.key,...,audio_features.tempo,audio_features.id,audio_features.time_signature,artist_id,artist_name,artist_popularity,album_id,album_name,album_release_date,album_total_tracks
0,1,212600,False,1,77,4WUepByoeqcedHoYhSNHRt,Welcome To New York (Taylor's Version),0.757,0.61,7.0,...,116.998,4WUepByoeqcedHoYhSNHRt,4.0,06HL4z0CvFAxyc27GX,Taylor Swift,120,1o59UpKw81iHR0HPiSkJR0,1989 (Taylor's Version) [Deluxe],2023-10-27,22
1,1,231833,False,2,78,0108kcWLnn2HlH2kedi1gn,Blank Space (Taylor's Version),0.733,0.733,0.0,...,96.057,0108kcWLnn2HlH2kedi1gn,4.0,06HL4z0CvFAxyc27GX,Taylor Swift,120,1o59UpKw81iHR0HPiSkJR0,1989 (Taylor's Version) [Deluxe],2023-10-27,22
2,1,231000,False,3,79,3Vpk1hfMAQme8VJ0SNRSkd,Style (Taylor's Version),0.511,0.822,11.0,...,94.868,3Vpk1hfMAQme8VJ0SNRSkd,4.0,06HL4z0CvFAxyc27GX,Taylor Swift,120,1o59UpKw81iHR0HPiSkJR0,1989 (Taylor's Version) [Deluxe],2023-10-27,22


# 1. Completeness: Checking for any null or missing values

#### Analysis

In [14]:
## Calculating the percentage of missing values for each column
total_rows = len(df)
percentage_missing = (df.isnull().sum() / total_rows) * 100

## Creating a new DataFrame with both the count and percentage of missing values
completeness_report = pd.DataFrame({
    'Missing Values': df.isnull().sum(),
    'Percentage of Missing Values': percentage_missing
})

completeness_report

Unnamed: 0,Missing Values,Percentage of Missing Values
disc_number,0,0.0
duration_ms,0,0.0
explicit,0,0.0
track_number,0,0.0
track_popularity,0,0.0
track_id,8,1.48423
track_name,7,1.298701
audio_features.danceability,2,0.371058
audio_features.energy,2,0.371058
audio_features.key,1,0.185529


In [13]:
## Generate the report
print("The dataset has missing values in several columns:\n")
for index, row in completeness_report[completeness_report['Missing Values'] > 0].iterrows():
  print("\t"f"{index}: {int(row['Missing Values'])} missing values ({row['Percentage of Missing Values']:.2f}%)")

The dataset has missing values in several columns:

	track_id: 8 missing values (1.48%)
	track_name: 7 missing values (1.30%)
	audio_features.danceability: 2 missing values (0.37%)
	audio_features.energy: 2 missing values (0.37%)
	audio_features.key: 1 missing values (0.19%)
	audio_features.loudness: 2 missing values (0.37%)
	audio_features.speechiness: 1 missing values (0.19%)
	audio_features.acousticness: 1 missing values (0.19%)
	audio_features.liveness: 1 missing values (0.19%)
	audio_features.tempo: 1 missing values (0.19%)
	audio_features.time_signature: 1 missing values (0.19%)
	album_name: 62 missing values (11.50%)


#### Report

The dataset has missing values in several columns:

	track_id: 8 missing values (1.48%)
	track_name: 7 missing values (1.30%)
	audio_features.danceability: 2 missing values (0.37%)
	audio_features.energy: 2 missing values (0.37%)
	audio_features.key: 1 missing values (0.19%)
	audio_features.loudness: 2 missing values (0.37%)
	audio_features.speechiness: 1 missing values (0.19%)
	audio_features.acousticness: 1 missing values (0.19%)
	audio_features.liveness: 1 missing values (0.19%)
	audio_features.tempo: 1 missing values (0.19%)
	audio_features.time_signature: 1 missing values (0.19%)
	album_name: 62 missing values (11.50%)

# 2. Consistency: Examine the dataset for any inconsistencies (**Logical** inconsistencies in this case.)

#### Analysis

In [17]:

# Calculate the number of unique values in each column
unique_value_counts = df.nunique()

print(unique_value_counts)

disc_number                          2
duration_ms                        364
explicit                             4
track_number                        46
track_popularity                    73
track_id                           512
track_name                         331
audio_features.danceability        267
audio_features.energy              348
audio_features.key                  12
audio_features.loudness            448
audio_features.mode                  2
audio_features.speechiness         292
audio_features.acousticness        401
audio_features.instrumentalness    240
audio_features.liveness            271
audio_features.valence             326
audio_features.tempo               450
audio_features.id                  519
audio_features.time_signature        3
artist_id                            1
artist_name                          1
artist_popularity                    1
album_id                            26
album_name                          24
album_release_date       

In [24]:
# Aply value_counts() to each column
for column in df.columns:
    counts = df[column].value_counts()
    # Mostrar todos los valores y recuentos
    # with pd.option_context('display.max_rows', None):
    print(f"Value counts for {column}:\n{counts}\n\n\n")

Value counts for disc_number:
1    522
2     17
Name: disc_number, dtype: int64



Value counts for duration_ms:
231000    6
212600    5
247533    5
235800    5
271000    4
         ..
290040    1
214373    1
221800    1
199733    1
179066    1
Name: duration_ms, Length: 364, dtype: int64



Value counts for explicit:
False    480
True      54
No         4
Si         1
Name: explicit, dtype: int64



Value counts for track_number:
8     29
1     28
6     28
7     28
2     28
5     28
3     28
4     28
11    27
12    27
13    27
10    27
9     27
14    25
15    24
16    21
17    18
18    14
19    12
20    10
21    10
22     8
23     4
26     3
24     3
25     3
28     2
29     2
30     2
27     2
39     1
45     1
44     1
43     1
42     1
41     1
40     1
36     1
38     1
37     1
35     1
34     1
33     1
32     1
31     1
46     1
Name: track_number, dtype: int64



Value counts for track_popularity:
 70    24
 72    20
 82    18
 80    18
 76    17
       ..
-70     1
-92     1


In [21]:
track_popularity_value_counts = df['track_popularity'].value_counts().reset_index()
track_popularity_value_counts.columns = ['Valor', 'Recuento']

# Mostrar todos los valores y recuentos
with pd.option_context('display.max_rows', None):
    print(track_popularity_value_counts)

    Valor  Recuento
0      70        24
1      72        20
2      82        18
3      80        18
4      76        17
5      78        17
6      68        16
7      71        16
8      74        15
9      77        15
10     60        15
11     73        13
12     69        12
13     84        12
14     83        11
15     49        11
16     47        11
17     36        11
18     58        10
19     66        10
20     33        10
21     81        10
22     75        10
23     67         9
24     50         9
25     53         9
26     37         9
27     79         8
28     51         8
29     56         8
30     48         8
31     85         8
32     34         7
33     59         7
34     61         7
35     55         6
36     87         6
37     46         6
38     65         6
39     43         5
40     40         5
41     42         5
42     35         5
43     54         5
44     64         5
45     86         5
46     62         4
47     44         4
48     52         4


#### Report

*   The **explicit** column has values like 'False', 'True', 'Si', 'No'. The presence of both English and Spanish values ('Si', 'No') for a boolean field indicates inconsistency.

* The set of possible values for the **track_popularity** field is presumably {0, 1, 2, 3, ..., 100}. If this is true, all different values imply data inconsistency.

# 3. Conformity: Validate that all data is in the correct format and adheres to specific standards.

# 4. Accuracy: This dimension is challenging to assess without an external source of truth but will look for indicators of inaccuracy.

# 5. Integrity: Assess the integrity of relationships within the dataset.


# 6. Timeliness: Evaluate the relevance and currency of the data, particularly focusing on date fields.
