# Introduction to Data Analysis
Welcome to this introductory course on Data Analysis. In this course, we will explore the fundamental concepts and techniques used in the field of data analysis. Data analysis is a critical skill in various domains, allowing us to make sense of complex data and extract meaningful insights.

## Objectives
- Understand the basic concepts of data analysis.
- Learn how to import and handle different types of data in Python.
- Perform basic data exploration and visualization.

## pandas, numpy, and matplotlib:

- pandas: 
  - A powerful data manipulation and analysis library for Python.
  - Provides data structures and functions to efficiently handle and analyze structured data.
  - Allows for easy data cleaning, transformation, and exploration.
  - Supports various data formats, including CSV, Excel, SQL databases, and more.
  - Enables data aggregation, filtering, and merging operations.
  - Provides tools for handling missing data and time series data.

- numpy:
  - A fundamental package for scientific computing with Python.
  - Provides support for large, multi-dimensional arrays and matrices.
  - Offers a wide range of mathematical functions for array operations.
  - Enables efficient numerical computations and data manipulation.
  - Supports advanced mathematical operations, such as linear algebra, Fourier transforms, and random number generation.
  - Integrates well with other libraries for data analysis and visualization.

- matplotlib:
  - A comprehensive library for creating static, animated, and interactive visualizations in Python.
  - Provides a wide range of plotting functions and styles for creating various types of plots, such as line plots, scatter plots, bar plots, histograms, and more.
  - Offers fine-grained control over plot elements, including axes, labels, legends, and annotations.
  - Supports customization of plot appearance, including colors, markers, and line styles.
  - Allows for the creation of complex visualizations, including subplots, grids, and 3D plots.
  - Integrates well with other libraries, such as pandas and numpy, for seamless data visualization.

# Data Formats

When you're engaged in data analysis, you'll encounter a variety of data formats. Each format has its unique structure and use cases. Understanding these formats is crucial for effective data manipulation and analysis. Here are some common data formats you might encounter:

1. **CSV (Comma-Separated Values)**
   - CSV files are simple, text-based files where data is separated by commas (or sometimes other delimiters like semicolons).
   - They are widely used due to their simplicity and human-readable format.
   - CSV files are excellent for tabular data and are compatible with most data analysis tools and spreadsheet applications.

2. **JSON (JavaScript Object Notation)**
   - JSON is a lightweight data-interchange format. It is easy for humans to read and write and easy for machines to parse and generate.
   - JSON is built on two structures: a collection of name/value pairs (like an object, record, struct, dictionary, hash table, keyed list, or associative array) and an ordered list of values (like an array, vector, list, or sequence).
   - Commonly used in web applications for data exchange and for configurations.

3. **XML (eXtensible Markup Language)**
   - XML is a markup language that defines a set of rules for encoding documents in a format that is both human-readable and machine-readable.
   - It is used to describe data, and its structure allows for a hierarchical organization of information.
   - XML is often used in web services and configuration files.

4. **Excel Formats (.xls, .xlsx)**
   - These are binary file formats used by Microsoft Excel.
   - Excel files can contain multiple sheets and are capable of storing large amounts of data with features like formulas, charts, and macros.
   - Common in business and academia for handling complex datasets with multiple attributes.

5. **SQL Database Formats**
   - SQL databases store data in tables, which can be queried using SQL (Structured Query Language).
   - Common SQL databases include MySQL, PostgreSQL, SQLite, and others.
   - Ideal for structured data and widely used in applications requiring frequent data retrieval and updates.

Each of these formats has its specific advantages, and the choice of format often depends on the nature of the data and the requirements of the analysis task.

In [21]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## Datasets from seaborn

### Iris
The Iris dataset is a popular dataset in the field of machine learning and data analysis. It contains measurements of four features (sepal length, sepal width, petal length, and petal width) for three different species of Iris flowers (setosa, versicolor, and virginica). The dataset is often used for classification tasks, as it provides a good example of a well-separated and easily distinguishable dataset.

### Titanic
The Titanic dataset is another commonly used dataset in data analysis and machine learning. It contains information about the passengers aboard the Titanic, including their age, sex, passenger class, and survival status. This dataset is often used for predictive modeling tasks, as it allows us to explore factors that may have influenced the survival of passengers during the Titanic disaster.

In [22]:
# Import necessary libraries
import pandas as pd
import seaborn as sns

# Importing datasets
# Iris Dataset
iris = sns.load_dataset('iris')

# Titanic Dataset
titanic = sns.load_dataset('titanic')

# Display a message confirming the datasets are loaded
print("Datasets loaded successfully")

Datasets loaded successfully


In [23]:
iris.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [24]:
titanic.head(10)

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True
5,0,3,male,,0,0,8.4583,Q,Third,man,True,,Queenstown,no,True
6,0,1,male,54.0,0,0,51.8625,S,First,man,True,E,Southampton,no,True
7,0,3,male,2.0,3,1,21.075,S,Third,child,False,,Southampton,no,False
8,1,3,female,27.0,0,2,11.1333,S,Third,woman,False,,Southampton,yes,False
9,1,2,female,14.0,1,0,30.0708,C,Second,child,False,,Cherbourg,yes,False


In [25]:
iris.info()

print("")

titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal_length  150 non-null    float64
 1   sepal_width   150 non-null    float64
 2   petal_length  150 non-null    float64
 3   petal_width   150 non-null    float64
 4   species       150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   survived     891 non-null    int64   
 1   pclass       891 non-null    int64   
 2   sex          891 non-null    object  
 3   age          714 non-null    float64 
 4   sibsp        891 non-null    int64   
 5   parch        891 non-null    int64   
 6   fare         891 non-null    float64 
 7   embarked     889 non-null    object  
 8   cl

In [6]:
print(iris.describe())

       sepal_length  sepal_width  petal_length  petal_width
count    150.000000   150.000000    150.000000   150.000000
mean       5.843333     3.057333      3.758000     1.199333
std        0.828066     0.435866      1.765298     0.762238
min        4.300000     2.000000      1.000000     0.100000
25%        5.100000     2.800000      1.600000     0.300000
50%        5.800000     3.000000      4.350000     1.300000
75%        6.400000     3.300000      5.100000     1.800000
max        7.900000     4.400000      6.900000     2.500000


In [7]:
print(titanic.describe())

         survived      pclass         age       sibsp       parch        fare
count  891.000000  891.000000  714.000000  891.000000  891.000000  891.000000
mean     0.383838    2.308642   29.699118    0.523008    0.381594   32.204208
std      0.486592    0.836071   14.526497    1.102743    0.806057   49.693429
min      0.000000    1.000000    0.420000    0.000000    0.000000    0.000000
25%      0.000000    2.000000   20.125000    0.000000    0.000000    7.910400
50%      0.000000    3.000000   28.000000    0.000000    0.000000   14.454200
75%      1.000000    3.000000   38.000000    1.000000    0.000000   31.000000
max      1.000000    3.000000   80.000000    8.000000    6.000000  512.329200


### Exercise: Find the repositories of the dataset for extra information

## Read different data formats on pandas
Here are some common functions for reading different data formats in pandas:

- **pd.read_csv()** - Read a comma-separated values (CSV) file into DataFrame.
- **pd.read_json()** - Convert a JSON string to pandas object.
- **pd.read_xml()** - Read XML file(s) into DataFrame.
- **pd.read_excel()** - Read an Excel file into a pandas DataFrame.

Let's see some examples.

### Json

In [26]:
# import a json file and read it with pandas
import json
with open('students_data.json') as f:
    data = json.load(f)

In [27]:
# convert to  pandas

df = pd.DataFrame(data)
df.head()

Unnamed: 0,Student_ID,Age,GPA
0,S001,18,3.11
1,S002,23,3.57
2,S003,25,2.23
3,S030,20,2.7


In [28]:
# or read directly with pandas
df = pd.read_json('students_data.json')
df.head()

Unnamed: 0,Student_ID,Age,GPA
0,S001,18,3.11
1,S002,23,3.57
2,S003,25,2.23
3,S030,20,2.7


### xml

In [19]:
# open students_data.xml with pandas

df = pd.read_xml('students_data.xml')
df.head()

Unnamed: 0,Student_ID,Age,GPA
0,S001,18,3.11
1,S002,23,3.57
2,S003,25,2.23
3,S004,24,3.64
4,S005,18,2.37


### excel

In [18]:
# read a .xlsx file with pandas

df = pd.read_excel('students_data.xlsx')
df.head()


Unnamed: 0,Student_ID,Age,GPA
0,S001,18,3.11
1,S002,23,3.57
2,S003,25,2.23
3,S004,24,3.64
4,S005,18,2.37


## How to access the data

Now we see how to access to specific data in the dataset. We will use the Iris dataset as an example.

In [29]:
# read the first column of the iris dataframe into a series
sepal_length_series = iris['sepal_length']
sepal_length_series

0      5.1
1      4.9
2      4.7
3      4.6
4      5.0
      ... 
145    6.7
146    6.3
147    6.5
148    6.2
149    5.9
Name: sepal_length, Length: 150, dtype: float64

In [30]:
# what is its type?
print(type(sepal_length_series))

<class 'pandas.core.series.Series'>


In [31]:
# let's make it a list type
iris_list = sepal_length_series.tolist()
iris_list

[5.1,
 4.9,
 4.7,
 4.6,
 5.0,
 5.4,
 4.6,
 5.0,
 4.4,
 4.9,
 5.4,
 4.8,
 4.8,
 4.3,
 5.8,
 5.7,
 5.4,
 5.1,
 5.7,
 5.1,
 5.4,
 5.1,
 4.6,
 5.1,
 4.8,
 5.0,
 5.0,
 5.2,
 5.2,
 4.7,
 4.8,
 5.4,
 5.2,
 5.5,
 4.9,
 5.0,
 5.5,
 4.9,
 4.4,
 5.1,
 5.0,
 4.5,
 4.4,
 5.0,
 5.1,
 4.8,
 5.1,
 4.6,
 5.3,
 5.0,
 7.0,
 6.4,
 6.9,
 5.5,
 6.5,
 5.7,
 6.3,
 4.9,
 6.6,
 5.2,
 5.0,
 5.9,
 6.0,
 6.1,
 5.6,
 6.7,
 5.6,
 5.8,
 6.2,
 5.6,
 5.9,
 6.1,
 6.3,
 6.1,
 6.4,
 6.6,
 6.8,
 6.7,
 6.0,
 5.7,
 5.5,
 5.5,
 5.8,
 6.0,
 5.4,
 6.0,
 6.7,
 6.3,
 5.6,
 5.5,
 5.5,
 6.1,
 5.8,
 5.0,
 5.6,
 5.7,
 5.7,
 6.2,
 5.1,
 5.7,
 6.3,
 5.8,
 7.1,
 6.3,
 6.5,
 7.6,
 4.9,
 7.3,
 6.7,
 7.2,
 6.5,
 6.4,
 6.8,
 5.7,
 5.8,
 6.4,
 6.5,
 7.7,
 7.7,
 6.0,
 6.9,
 5.6,
 7.7,
 6.3,
 6.7,
 7.2,
 6.2,
 6.1,
 6.4,
 7.2,
 7.4,
 7.9,
 6.4,
 6.3,
 6.1,
 7.7,
 6.3,
 6.4,
 6.0,
 6.9,
 6.7,
 6.9,
 5.8,
 6.8,
 6.7,
 6.7,
 6.3,
 6.5,
 6.2,
 5.9]

In [35]:
iris_list[ : 5]

# let's make it a numpy array
iris_array = np.array(iris_list)
iris_array

array([5.1, 4.9, 4.7, 4.6, 5. , 5.4, 4.6, 5. , 4.4, 4.9, 5.4, 4.8, 4.8,
       4.3, 5.8, 5.7, 5.4, 5.1, 5.7, 5.1, 5.4, 5.1, 4.6, 5.1, 4.8, 5. ,
       5. , 5.2, 5.2, 4.7, 4.8, 5.4, 5.2, 5.5, 4.9, 5. , 5.5, 4.9, 4.4,
       5.1, 5. , 4.5, 4.4, 5. , 5.1, 4.8, 5.1, 4.6, 5.3, 5. , 7. , 6.4,
       6.9, 5.5, 6.5, 5.7, 6.3, 4.9, 6.6, 5.2, 5. , 5.9, 6. , 6.1, 5.6,
       6.7, 5.6, 5.8, 6.2, 5.6, 5.9, 6.1, 6.3, 6.1, 6.4, 6.6, 6.8, 6.7,
       6. , 5.7, 5.5, 5.5, 5.8, 6. , 5.4, 6. , 6.7, 6.3, 5.6, 5.5, 5.5,
       6.1, 5.8, 5. , 5.6, 5.7, 5.7, 6.2, 5.1, 5.7, 6.3, 5.8, 7.1, 6.3,
       6.5, 7.6, 4.9, 7.3, 6.7, 7.2, 6.5, 6.4, 6.8, 5.7, 5.8, 6.4, 6.5,
       7.7, 7.7, 6. , 6.9, 5.6, 7.7, 6.3, 6.7, 7.2, 6.2, 6.1, 6.4, 7.2,
       7.4, 7.9, 6.4, 6.3, 6.1, 7.7, 6.3, 6.4, 6. , 6.9, 6.7, 6.9, 5.8,
       6.8, 6.7, 6.7, 6.3, 6.5, 6.2, 5.9])

In [36]:
# let's read the data of the first row
# iris.loc[i] returns the ith row of the dataframe

iris.iloc[0]

sepal_length       5.1
sepal_width        3.5
petal_length       1.4
petal_width        0.2
species         setosa
Name: 0, dtype: object

In [38]:
# and how do I get the data given a row and a column?
# iris.loc[i,j] returns the data in the ith row and jth column of the dataframe
type(iris.loc[0,'sepal_length'])


numpy.float64

# Taylor Swift's Discography
Let's explore the discography of Taylor Swift, one of the most popular singers of all time.
You can download and find further information on the dataset from Kaggle: 
https://www.kaggle.com/datasets/jarredpriester/taylor-swift-spotify-dataset?resource=download

Let's read the data into a pandas DataFrame using the `read_csv` function:

<details>
<summary>Click to expand!</summary>

```python
# open taylor_swift_spotify.csv
taylor_swift = pd.read_csv('taylor_swift_spotify.csv', index_col=0)

In [39]:
# add your code here
taylor_swift = pd.read_csv('taylor_swift_spotify.csv', index_col=0)

In [40]:
taylor_swift

Unnamed: 0,name,album,release_date,track_number,id,uri,acousticness,danceability,energy,instrumentalness,liveness,loudness,speechiness,tempo,valence,popularity,duration_ms
0,Welcome To New York (Taylor's Version),1989 (Taylor's Version) [Deluxe],2023-10-27,1,4WUepByoeqcedHoYhSNHRt,spotify:track:4WUepByoeqcedHoYhSNHRt,0.009420,0.757,0.610,0.000037,0.3670,-4.840,0.0327,116.998,0.685,72,212600
1,Blank Space (Taylor's Version),1989 (Taylor's Version) [Deluxe],2023-10-27,2,0108kcWLnn2HlH2kedi1gn,spotify:track:0108kcWLnn2HlH2kedi1gn,0.088500,0.733,0.733,0.000000,0.1680,-5.376,0.0670,96.057,0.701,73,231833
2,Style (Taylor's Version),1989 (Taylor's Version) [Deluxe],2023-10-27,3,3Vpk1hfMAQme8VJ0SNRSkd,spotify:track:3Vpk1hfMAQme8VJ0SNRSkd,0.000421,0.511,0.822,0.019700,0.0899,-4.785,0.0397,94.868,0.305,74,231000
3,Out Of The Woods (Taylor's Version),1989 (Taylor's Version) [Deluxe],2023-10-27,4,1OcSfkeCg9hRC2sFKB4IMJ,spotify:track:1OcSfkeCg9hRC2sFKB4IMJ,0.000537,0.545,0.885,0.000056,0.3850,-5.968,0.0447,92.021,0.206,73,235800
4,All You Had To Do Was Stay (Taylor's Version),1989 (Taylor's Version) [Deluxe],2023-10-27,5,2k0ZEeAqzvYMcx9Qt5aClQ,spotify:track:2k0ZEeAqzvYMcx9Qt5aClQ,0.000656,0.588,0.721,0.000000,0.1310,-5.579,0.0317,96.997,0.520,72,193289
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
525,Our Song,Taylor Swift,2006-10-24,11,15DeqWWQB4dcEWzJg15VrN,spotify:track:15DeqWWQB4dcEWzJg15VrN,0.111000,0.668,0.672,0.000000,0.3290,-4.931,0.0303,89.011,0.539,74,201106
526,I'm Only Me When I'm With You,Taylor Swift,2006-10-24,12,0JIdBrXGSJXS72zjF9ss9u,spotify:track:0JIdBrXGSJXS72zjF9ss9u,0.004520,0.563,0.934,0.000807,0.1030,-3.629,0.0646,143.964,0.518,59,213053
527,Invisible,Taylor Swift,2006-10-24,13,5OOd01o2YS1QFwdpVLds3r,spotify:track:5OOd01o2YS1QFwdpVLds3r,0.637000,0.612,0.394,0.000000,0.1470,-5.723,0.0243,96.001,0.233,56,203226
528,A Perfectly Good Heart,Taylor Swift,2006-10-24,14,1spLfUJxtyVyiKKTegQ2r4,spotify:track:1spLfUJxtyVyiKKTegQ2r4,0.003490,0.483,0.751,0.000000,0.1280,-5.726,0.0365,156.092,0.268,54,220146


#### Let's discuss the dataset

- which pieces of information are available?
- what is the meaning of each column?
- are songs repeated?
- are there any missing values?
- what is the data type of time column?
- are all the column necessary?
- is the dataset reliable? how can we check it?