# Base Python: Read and write files

### 1. Open a txt file named 'Hello world' and write all elements of the list x in newlines, then open that file and print all lines in the file

```python
x = ['Python is flexible','but', 'Python is slow!']
```

# Pandas is built on Numpy

In [23]:
import pandas as pd

## Pandas can read:

1. CSV (Comma-Separated Values)
    - Function: pd.read_csv()
    - Description: CSV files are simple text files where data is separated by commas (or other delimiters).
    - Example: df = pd.read_csv('data.csv')
2. Excel Files (XLSX/XLS)
    - Function: pd.read_excel()
    - Description: Pandas can read from both .xlsx (Excel 2007+) and .xls (older Excel formats).
    - Example: df = pd.read_excel('data.xlsx', sheet_name='Sheet1')
3. JSON (JavaScript Object Notation)
    - Function: pd.read_json()
    - Description: JSON files can store structured data as key-value pairs, often used in web applications.
    - Example: df = pd.read_json('data.json')
4. SQL Databases
    - Function: pd.read_sql(), pd.read_sql_query(), pd.read_sql_table()
    - Description: Pandas can read data from SQL databases such as SQLite, PostgreSQL, MySQL, etc.
    - Example: Query: df = pd.read_sql_query('SELECT * FROM table_name', connection), Full Table: df = pd.read_sql_table('table_name', connection)
5. XML Files
    - Function: pd.read_xml()
    - Description: XML is a markup language commonly used for data exchange. Pandas can read XML and convert it into a DataFrame.
    - Example: df = pd.read_xml('data.xml')
6. Plain Text Files
    - Function: pd.read_fwf() (for fixed-width formatted files), pd.read_csv('',delimiter='') (for delimited text files)
    - Description: Text files with either fixed-width columns or custom delimiters.
    - Example: df = pd.read_fwf('data.txt')
7. Compressed Files
    - Function: pd.read_csv(), pd.read_json(), pd.read_parquet(), etc. (with compression specified)
    - Description: Pandas can automatically handle compressed files such as .gz, .bz2, .zip, .xz.
    - Example: df = pd.read_csv('data.csv.gz')
      
AND MUCH MORE...

In [24]:
df = pd.read_csv('Health_Sleep_Statistics.csv')
y = pd.read_csv('Health_Sleep_Statistics.csv')
df.tail(3)
df.head(4)

df.columns
df[['Age','User ID']].head(2) #[0:3]
df.iloc[0:4,1] # df.iloc[r,c]

df.loc[df['Gender']== 'f'].head(3)

df.loc[df['Bedtime']== '23:00']
df.describe().round()
df

Unnamed: 0,User ID,Age,Gender,Sleep Quality,Bedtime,Wake-up Time,Daily Steps,Calories Burned,Physical Activity Level,Dietary Habits,Sleep Disorders,Medication Usage
0,1,25,f,8,23:00,06:30,8000,2500,medium,healthy,no,no
1,2,34,m,7,00:30,07:00,5000,2200,low,unhealthy,yes,yes
2,3,29,f,9,22:45,06:45,9000,2700,high,healthy,no,no
3,4,41,m,5,01:00,06:30,4000,2100,low,unhealthy,yes,no
4,5,22,f,8,23:30,07:00,10000,2800,high,medium,no,no
...,...,...,...,...,...,...,...,...,...,...,...,...
95,96,43,m,7,00:45,07:15,6500,2400,medium,medium,no,no
96,97,33,f,8,23:15,06:15,8500,2600,high,medium,no,no
97,98,46,m,4,01:30,07:00,3000,2000,low,unhealthy,yes,yes
98,99,25,f,9,22:15,06:45,9500,2700,high,healthy,no,no


## Exercises

### 1. Load the CSV file and display the first 10 rows of the dataset. Then, print the summary statistics of the Daily Steps and Calories Burned columns

In [4]:
import pandas as pd
df = pd.read_csv('Health_Sleep_Statistics.csv')
df.head(10)
df[['Daily Steps', 'Calories Burned']].describe().round(2)

Unnamed: 0,Daily Steps,Calories Burned
count,100.0,100.0
mean,6830.0,2421.0
std,2498.71,281.07
min,3000.0,2000.0
25%,4750.0,2175.0
50%,6750.0,2400.0
75%,9000.0,2700.0
max,11000.0,2900.0


In [5]:
f = pd.unique(df['Gender'])
print(f,type(f))

['f' 'm'] <class 'numpy.ndarray'>


### 2. Filter the dataset to find users who have a sleep quality score below 3 and burn more than 2500 calories per day. Display their User ID, Age, and Sleep Quality

In [6]:
filtered_df = df[(df['Sleep Quality']>3) & (df['Calories Burned']>2500)]
filtered_df[['User ID','Age','Sleep Quality']]

Unnamed: 0,User ID,Age,Sleep Quality
2,3,29,9
4,5,22,8
6,7,30,8
8,9,27,9
11,12,23,9
14,15,28,9
18,19,33,9
20,21,29,8
22,23,40,9
24,25,32,8


### 3. Create a new column called Sleep Duration that calculates the duration of sleep in hours using the Bedtime and Wake-up Time columns. Assume these columns are in 24-hour format (e.g., "22:30", "06:45")

In [16]:
'''# Convert Bedtime and Wake-up Time to datetime objects (with seconds included)
df['Bedtime'] = pd.to_datetime(df['Bedtime'], format='%H:%M:%S')
df['Wake-up Time'] = pd.to_datetime(df['Wake-up Time'], format='%H:%M:%S')

# Subtract and calculate sleep duration in hours
df['Sleep Duration'] = (df['Wake-up Time'] - df['Bedtime']).dt.total_seconds() / 3600
df.loc[df['Sleep Duration'] <= 0, 'Sleep Duration'] = df['Sleep Duration'] + 24
# Handle cases where bedtime is after midnight
#df['Sleep Duration'] = df['Sleep Duration'].apply(lambda x: x if x > 0 else x + 24)

print(df.head(3))'''

# Перетворюємо час у форматі 24 години для колонок 'Bedtime' та 'Wake-up Time'
df['Bedtime'] = pd.to_datetime(df['Bedtime'], format='%H:%M')
df['Wake-up Time'] = pd.to_datetime(df['Wake-up Time'], format='%H:%M')

# Обчислюємо тривалість сну в годинах
df['Sleep Duration'] = (df['Wake-up Time'] - df['Bedtime']).dt.total_seconds() / 3600  # конвертуємо секунди в години

# Показуємо перші кілька рядків для перевірки
print(df.head())


   User ID  Age Gender  Sleep Quality             Bedtime        Wake-up Time  \
0        1   25      f              8 1900-01-01 23:00:00 1900-01-01 06:30:00   
1        2   34      m              7 1900-01-01 00:30:00 1900-01-01 07:00:00   
2        3   29      f              9 1900-01-01 22:45:00 1900-01-01 06:45:00   
3        4   41      m              5 1900-01-01 01:00:00 1900-01-01 06:30:00   
4        5   22      f              8 1900-01-01 23:30:00 1900-01-01 07:00:00   

   Daily Steps  Calories Burned Physical Activity Level Dietary Habits  \
0         8000             2500                  medium        healthy   
1         5000             2200                     low      unhealthy   
2         9000             2700                    high        healthy   
3         4000             2100                     low      unhealthy   
4        10000             2800                    high         medium   

  Sleep Disorders Medication Usage  Sleep Duration  
0              

### 4. Find the user with the maximum physical activity level and display their User ID, Daily Steps, Calories Burned, and Physical Activity Level.

In [25]:
# Find the row with the maximum Physical Activity Level
max_activity_user = df.loc[df['Physical Activity Level'].idxmax()]

# Display User ID, Daily Steps, Calories Burned, and Physical Activity Level
print(max_activity_user[['User ID', 'Daily Steps', 'Calories Burned', 'Physical Activity Level']])

User ID                         1
Daily Steps                  8000
Calories Burned              2500
Physical Activity Level    medium
Name: 0, dtype: object


### 5. Group the users by Gender and calculate the average Daily Steps and Calories Burned for each gender

In [18]:
# Group by Gender and calculate the mean of Daily Steps and Calories Burned
gender_group = df.groupby('Gender')[['Daily Steps', 'Calories Burned']].mean()

print(gender_group)

        Daily Steps  Calories Burned
Gender                              
f            8940.0           2654.0
m            4720.0           2188.0


### 6. Check if there are any missing values in the dataset and fill them with appropriate values. For numeric columns, fill with the column mean, and for categorical columns, fill with the mode

In [26]:
# Check for missing values
print(df.isnull().sum())

# Fill numeric columns with mean and categorical columns with mode
for column in df.columns:
    if df[column].dtype == 'object':
        df[column].fillna(df[column].mode()[0], inplace=True)
    else:
        df[column].fillna(df[column].mean(), inplace=True)

# Check if missing values are filled
print(df.isnull().sum())



User ID                    0
Age                        0
Gender                     0
Sleep Quality              0
Bedtime                    0
Wake-up Time               0
Daily Steps                0
Calories Burned            0
Physical Activity Level    0
Dietary Habits             0
Sleep Disorders            0
Medication Usage           0
dtype: int64
User ID                    0
Age                        0
Gender                     0
Sleep Quality              0
Bedtime                    0
Wake-up Time               0
Daily Steps                0
Calories Burned            0
Physical Activity Level    0
Dietary Habits             0
Sleep Disorders            0
Medication Usage           0
dtype: int64


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[column].fillna(df[column].mean(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[column].fillna(df[column].mode()[0], inplace=True)


### 7. Add a column called Age Group that categorizes users into Young (0-30), Middle-aged (31-60), and Senior (60+) based on their age

In [21]:
# Check for missing values
print(df.isnull().sum())

# Fill numeric columns with mean and categorical columns with mode
for column in df.columns:
    if df[column].dtype == 'object':
        df[column].fillna(df[column].mode()[0], inplace=True)
    else:
        df[column].fillna(df[column].mean(), inplace=True)

# Check if missing values are filled
print(df.isnull().sum())

User ID                    0
Age                        0
Gender                     0
Sleep Quality              0
Bedtime                    0
Wake-up Time               0
Daily Steps                0
Calories Burned            0
Physical Activity Level    0
Dietary Habits             0
Sleep Disorders            0
Medication Usage           0
Sleep Duration             0
dtype: int64
User ID                    0
Age                        0
Gender                     0
Sleep Quality              0
Bedtime                    0
Wake-up Time               0
Daily Steps                0
Calories Burned            0
Physical Activity Level    0
Dietary Habits             0
Sleep Disorders            0
Medication Usage           0
Sleep Duration             0
dtype: int64


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[column].fillna(df[column].mean(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[column].fillna(df[column].mode()[0], inplace=True)


### 8. Identify the users who report taking medication and have a Sleep Disorder. Display their User ID, Age, and Medication Usage

In [13]:
# Filter users with Medication Usage and Sleep Disorders
med_sleep_disorder_users = df[(df['Medication Usage'] == 'Yes') & (df['Sleep Disorders'] == 'Yes')]

# Display User ID, Age, and Medication Usage
print(med_sleep_disorder_users[['User ID', 'Age', 'Medication Usage']])

Empty DataFrame
Columns: [User ID, Age, Medication Usage]
Index: []


### 9. Calculate the total Calories Burned by each Age Group (from the column you created in question 7)

In [14]:
# Group by Age Group and calculate total Calories Burned
calories_by_age_group = df.groupby('Age Group')['Calories Burned'].sum()

print(calories_by_age_group)

KeyError: 'Age Group'

### 10. Sort the dataset by Physical Activity Level in descending order and display the top 5 users by Physical Activity Level, along with their User ID, Daily Steps, and Calories Burned

In [15]:
# Sort the dataset by Physical Activity Level in descending order
sorted_df = df.sort_values(by='Physical Activity Level', ascending=False)

# Display the top 5 users by Physical Activity Level
print(sorted_df[['User ID', 'Daily Steps', 'Calories Burned']].head(5))

    User ID  Daily Steps  Calories Burned
0         1         8000             2500
63       64         6500             2400
34       35        10000             2750
35       36         6000             2300
39       40         5000             2200
