In [21]:
import pandas as pd

path = '../data/star_classification.csv'
dataset = pd.read_csv(path)

`head()` - Most basic summarization of the content of the DataFrame. It returns the first `n` rows of the DataFrame.
`tail()` - Counterpart of `head()`. It returns the last `n` rows of the DataFrame.
- The default value of n is 5

In [22]:
dataset.head()

Unnamed: 0,Temperature (K),Luminosity(L/Lo),Radius(R/Ro),Absolute magnitude(Mv),Star type,Star color,Spectral Class
0,3068,0.0024,0.17,16.12,0,Red,M
1,3042,0.0005,0.1542,16.6,0,Red,M
2,2600,0.0003,0.102,18.7,0,Red,M
3,2800,0.0002,0.16,16.65,0,Red,M
4,1939,0.000138,0.103,20.06,0,Red,M


`info()` shows the number of non-null values in each column, and the data type of each column.

In [23]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 240 entries, 0 to 239
Data columns (total 7 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Temperature (K)         240 non-null    int64  
 1   Luminosity(L/Lo)        240 non-null    float64
 2   Radius(R/Ro)            240 non-null    float64
 3   Absolute magnitude(Mv)  240 non-null    float64
 4   Star type               240 non-null    int64  
 5   Star color              240 non-null    object 
 6   Spectral Class          240 non-null    object 
dtypes: float64(3), int64(2), object(2)
memory usage: 13.2+ KB


`describe()` shows a statistical summary for numerical-typed columns (int, float).
- It return a collection of statistical description of the column (count, mean, std, min, max, etc.).
- It excludes NaN values.
- It is Type-dependent, meaning it returns different statistical summary for numerical-typed and string-typed columns.
- Since it is a collection of statistical description, we can use it to get a specific statistical description of a column by using the name of the statistical description as the index of the collection. (ex. `dataset['Temperature'].mean()`)

In [24]:
dataset['Temperature (K)'].describe()

count      240.000000
mean     10497.462500
std       9552.425037
min       1939.000000
25%       3344.250000
50%       5776.000000
75%      15055.500000
max      40000.000000
Name: Temperature (K), dtype: float64

The code below reflects the return value of `describe()` when applied with string-typed columns.
- This returns the following statistical description of string-typed columns:
    - `count` - The number of non-null values in the column.
    - `unique` - The number of unique values in the column.
    - `top` - The most frequent value in the column.
    - `freq` - The number of times the most frequent value appears in the column.

In [25]:
dataset['Spectral Class'].describe()

count     240
unique      7
top         M
freq      111
Name: Spectral Class, dtype: object

In [26]:
dataset.describe()

Unnamed: 0,Temperature (K),Luminosity(L/Lo),Radius(R/Ro),Absolute magnitude(Mv),Star type
count,240.0,240.0,240.0,240.0,240.0
mean,10497.4625,107188.361635,237.157781,4.382396,2.5
std,9552.425037,179432.24494,517.155763,10.532512,1.711394
min,1939.0,8e-05,0.0084,-11.92,0.0
25%,3344.25,0.000865,0.10275,-6.2325,1.0
50%,5776.0,0.0705,0.7625,8.313,2.5
75%,15055.5,198050.0,42.75,13.6975,4.0
max,40000.0,849420.0,1948.5,20.06,5.0


`value_counts()` - part of the built-in functions applied for string-typed columns. This returns the number of times each unique value appears in the column.

In [31]:
dataset['Spectral Class'].value_counts()

M    111
B     46
O     40
A     19
F     17
K      6
G      1
Name: Spectral Class, dtype: int64

The code shows the basic method on transforming data values in a specific column of a DataFrame. This is similar to the `map()` and `apply()`. Those python built-in function offers more flexibility compared to this method

In [27]:
radius_mean = dataset['Radius(R/Ro)'].mean()
new_dataset = dataset['Radius(R/Ro)'] - radius_mean
new_dataset.head()

0   -236.987781
1   -237.003581
2   -237.055781
3   -236.997781
4   -237.054781
Name: Radius(R/Ro), dtype: float64

- `apply()` is used to apply a function on every value of a set direction axis, either index(X) or column (Y), of a Dataframe. 
- the function takes in the whole row or column as a Series, and returns a transformed value for each row or column.


- In the code below, we use a lambda function that takes in the row as a Series. It then gets the value of the column `Radius(R/Ro)` and returns the value of the column `Radius(R/Ro)` minus `radius_mean` as a Series. This is since the return value of lambda is replaced with a Series and not the whole DataFrame

Similar to this:
```
def remean(row):
    return row['Radius(R/Ro)'] = row['Radius(R/Ro)'] - row['radius_mean']  
```
This is why it does not display the whole DataFrame with the replaced specficied column values. 

In [29]:
new_dataset = dataset.apply(lambda x : x['Radius(R/Ro)'] - radius_mean , axis='columns')
new_dataset.head()

0   -236.987781
1   -237.003581
2   -237.055781
3   -236.997781
4   -237.054781
dtype: float64

- If we want to return the whole DataFrame with replaced values, we must use `apply()` with a custom function returning the `row` itself. Replacing is done within the function where only specific columns are replaced.
- The function applied within `apply()` is pure functions. Meaning that it does not create any side effects for the original DataFrame. It only returns a new DataFrame with the specified columns replaced.

In [149]:
star_dataset = pd.read_csv("../data/star_classification.csv")

radius_mean = star_dataset['Radius(R/Ro)'].mean()

def remean(row):
    row['Radius(R/Ro)'] = row['Radius(R/Ro)'] - radius_mean
    return row

star_dataset.apply(remean, axis='columns')


Unnamed: 0,Temperature (K),Luminosity(L/Lo),Radius(R/Ro),Absolute magnitude(Mv),Star type,Star color,Spectral Class
0,3068,0.002400,-236.987781,16.12,0,Red,M
1,3042,0.000500,-237.003581,16.60,0,Red,M
2,2600,0.000300,-237.055781,18.70,0,Red,M
3,2800,0.000200,-236.997781,16.65,0,Red,M
4,1939,0.000138,-237.054781,20.06,0,Red,M
...,...,...,...,...,...,...,...
235,38940,374830.000000,1118.842219,-9.93,5,Blue,O
236,30839,834042.000000,956.842219,-10.63,5,Blue,O
237,8829,537493.000000,1185.842219,-10.73,5,White,A
238,9235,404940.000000,874.842219,-11.23,5,White,A


- If we want to use `apply()` for Series, we must use `map()`. `map()` is similar to `apply()` but only applied on Series. 
- In the example below, we extracted the column series from the dataset DataFrame, which is `Radius(R/Ro)`, and applied it with `map()`. 
- For more simplicity, we use a lambda function since there is no need to consider the whole DataFrame to be the return value. 


In [28]:
new_dataset = dataset['Radius(R/Ro)'].map(lambda x: x - radius_mean)
new_dataset.head()

0   -236.987781
1   -237.003581
2   -237.055781
3   -236.997781
4   -237.054781
Name: Radius(R/Ro), dtype: float64

The code below is an example of the use of `apply()` with custom function with a conditional statements within it. This is particularly useful for tranforming the data type of a column. (ex. in the code below, we transformed the data type of the column `Spectral Class` from `string` to `int`)

In [32]:
def transform_into_numerical_value(row):
    if row['Spectral Class'] == 'M':
        return 0
    elif row['Spectral Class'] == 'B':
        return 1
    elif row['Spectral Class'] == 'O':
        return 2
    elif row['Spectral Class'] == 'A':
        return 3
    elif row['Spectral Class'] == 'F':
        return 4
    elif row['Spectral Class'] == 'K':
        return 5
    elif row['Spectral Class'] == 'G':
        return 6
new_dataset = dataset.apply(transform_into_numerical_value, axis='columns')       
new_dataset.value_counts() 

0    111
1     46
2     40
3     19
4     17
5      6
6      1
dtype: int64