# Lab 1.02 - Android Persistence

Import all necessary Python libraries and create a variable `android_persistence` to load the dataset [android_persistence_cpu.csv](https://github.com/HoGentTIN/dsai-en-labs/blob/main/data/android_persistence_cpu.csv). See the [code book](https://github.com/HoGentTIN/dsai-en-labs/blob/main/data/android_persistence_cpu.md) for more info on the contents. Note this file is not stored as a regular CSV file! You may need to tweak the parameters of the import function to load the file correctly.

In [6]:
# Importing the necessary packages
import numpy as np                                  # "Scientific computing"
import scipy.stats as stats                         # Statistical tests

import pandas as pd                                 # Data Frame
from pandas.api.types import CategoricalDtype

import matplotlib.pyplot as plt                     # Basic visualisation
from statsmodels.graphics.mosaicplot import mosaic  # Mosaic diagram
import seaborn as sns                               # Advanced data visualisation

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd                                 # Data Frame


Explore the dataset:

- How many variables and observations are present in the dataset?
- What is the level of measurement of each variable?
- Perform the conversion of the qualitative variables to the appropriate type (and specify the order of ordinal variables).
- List the data types in the dataset.

In [13]:
android = pd.read_csv("https://raw.githubusercontent.com/HoGentTIN/dsai-labs/main/data/android_persistence_cpu.csv", delimiter = ";")
android.head()

Unnamed: 0,Time,PersistenceType,DataSize
0,1.81,Sharedpreferences,Small
1,1.35,Sharedpreferences,Small
2,1.84,Sharedpreferences,Small
3,1.54,Sharedpreferences,Small
4,1.81,Sharedpreferences,Small


In [23]:
print(f"Dataset informatie:")
android.info()
print("*"*50)
# How many variables and observations are present in the dataset?
num_variables = android.shape[1]
num_observations = android.shape[0]
print(f"\nAantal variabelen: {num_variables}")
print(f"Aantal observaties: {num_observations}")

Dataset informatie:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 300 entries, 0 to 299
Data columns (total 3 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Time             300 non-null    float64
 1   PersistenceType  300 non-null    object 
 2   DataSize         300 non-null    object 
dtypes: float64(1), object(2)
memory usage: 7.2+ KB
**************************************************

Aantal variabelen: 3
Aantal observaties: 300


In [26]:
# What is the level of measurement of each variable?
levels_of_measurement = {
  'Time': 'Ratio',
  'PersistenceType': 'Nominaal',
  'DataSize' : 'Ordinaal'
  }

print("\nMeetniveaus:")
for variable, level in levels_of_measurement.items():
    print(f"{variable}: {level}")


Meetniveaus:
Time: Ratio
PersistenceType: Nominaal
DataSize: Ordinaal


In [30]:
# Perform the conversion of the qualitative variables to the appropriate type (and specify the order of ordinal variables).
android['PersistenceType'] = android['PersistenceType'].astype("category")
android['DataSize'] = pd.Categorical(android['DataSize'], categories=['Small', 'Medium', 'Large'], ordered=True)

# List the data types in the dataset.
print(android.dtypes)

Time                float64
PersistenceType    category
DataSize           category
dtype: object


Describe each variable.

In [45]:
print(android["Time"].describe())
print("*"*50)
print(android["DataSize"].describe())
print("*"*50)
print(android["PersistenceType"].describe())

count    300.000000
mean       6.230833
std        4.229599
min        1.090000
25%        1.790000
50%        6.185000
75%       10.662500
max       13.560000
Name: Time, dtype: float64
**************************************************
count       300
unique        3
top       Small
freq        120
Name: DataSize, dtype: object
**************************************************
count          300
unique           4
top       GreenDAO
freq            90
Name: PersistenceType, dtype: object


What unique values are there for the variables `PersistenceType` and `DataSize`? How often does each value occur?

In [15]:
# Find unique values
unique_persistence_types = android["PersistenceType"].unique()
unique_data_sizes = android["DataSize"].unique()

# Find frequency of each unique value
persistence_type_counts = android["PersistenceType"].value_counts()
data_size_counts = android["DataSize"].value_counts()

# Display results
print("Unique values for PersistenceType:", unique_persistence_types)
print("Frequency of each PersistenceType value:\n", persistence_type_counts)
print("Unique values for DataSize:", unique_data_sizes)
print("Frequency of each DataSize value:\n", data_size_counts)



Unique values for PersistenceType: ['Sharedpreferences' 'GreenDAO' 'SQLLite' 'Realm']
Frequency of each PersistenceType value:
 PersistenceType
GreenDAO             90
SQLLite              90
Realm                90
Sharedpreferences    30
Name: count, dtype: int64
Unique values for DataSize: ['Small' 'Medium' 'Large']
Frequency of each DataSize value:
 DataSize
Small     120
Medium     90
Large      90
Name: count, dtype: int64


In this dataset, it is especially interesting to know how often each unique combination of `PersistenceType` and `DataSize` occurs. Figure out how to use the Pandas function `crosstab()` to create a so-called contingency table for these variables. By the way, this concept will return in Module 4 (examining the relationship between 2 qualitative variables).

In [17]:
contingency_table = pd.crosstab(android["PersistenceType"], android["DataSize"])
# Display the result
print(contingency_table)

DataSize           Large  Medium  Small
PersistenceType                        
GreenDAO              30      30     30
Realm                 30      30     30
SQLLite               30      30     30
Sharedpreferences      0       0     30
