We only prepare the data for data scientists, not analyze it here  
Get the right data to solve your goal  
Data will not be similar in:
- Quality  
- Completeness  
- Coverage  
- Formatting  
- Structure  

Example: Predictive analysis for patient readmission within 30d of discharge  
Relevant data:  
Electronic health records  
- Patient Health Histories  
- Outcomes  
- Diagnoses  
- Treatment plans  
Hospital admission, discharge and transfer systems  
- Dates and reasons for admissions, discharges, and transfers within the hospital   
Wearable device data  
- Patient generated vital signs  
Insurance claim data  
- Procedures performed

# Data Collection
To collect data that is relevant, accurate and robust for data analysis needs, we need to consider:  
- Define objective – what to achieve with the data collection?  
- Data requirements – what the specific data needed to meet the objective?  
- Data sources – where to obtain the data?  
- Data quality – ensure the data collected is of high quality  
- Data volume – assess the volume of data needed and the capacity of handling  
- Ethics – ensure ethical practices complying with legal standards and permissions  
- Data format – what format to be used?  
- Collection methods and tools – choose appropriate methods and right tools  
- Timeframe and budget – align with project deadlines and budget constraints  
- Data privacy and security – protect privacy from individuals and secure storage and handling of data  
- Documentation – keep detailed documentation of the data collection process  
- Pilot test – validate data collection process and instruments  
- Review and adaptation – regularly review the process for any necessary adjustments  

# Data Pre-Processing
Selection  
- Subsetting: Choosing only the relevant data to the analysis, filtering rows/columns on criteria needed  

Sampling  
- Stratified sampling: Sample proportionally to keep same population proportions  

Formatting  
- Date formatting, numeric/text formatting, currencies, metric/imperial, text encoding, time zones, etc  

Data cleaning  
- Anomalies, errors can reveal insights or new patterns, not just errors. E.g. "are you still watching?" because someone might be afk  

Data transformation  
- Normalization, standardization, aggregation, discretisation, binning/bucketing  

Text data  
- Tokenisation, stemming/lemmatisation, vectorisation  

Image data  
- Resize, crop, normalize, colour space conversion, rotate, flip, translate, noise injection, brightness/contrast adjustment  


# Data Enrichment
Enhancing the data with additional context or merging it with other data sources to make it completer and more informative.  This step can involve adding demographic information, geographical tags, or combining datasets to provide more depth.  

- Data integration  
  - Combining data from different sources to create a more comprehensive dataset.  
- Data augmentation  
  - Adding information to existing records to enhance the depth of data on each data point.  
- Attribute enrichment  
  - Enhancing existing datasets by adding new attributes or features that were not previously included.  
- Temporal enrichment  
  - Adding time-related data to datasets.  
- Semantic enrichment  
  - Adding metadata or other semantic information to make data more understandable and usable.  

Data wrangling is the process of making data useful  



# Data Validation
Ensuring that the data is accurate, consistent, and usable for analysis.  
Validation checks are performed to verify data quality and correctness after cleaning and transformation.  
- Define Validation Rules and Criteria  
  - Establish the standards that your data must meet.  
- Check for accuracy  
  - Ensure that the data accurately reflects the real-world entities or values it represents.  
- Ensure consistency  
  - Verify that the data is consistent within the dataset and across related datasets.  
- Validate data completeness  
  - Ensure no critical data is missing, which could impact analysis.  
- Test for logical integrity  
  - Confirm that the data makes logical sense and adheres to known constraints or relationships.  
- Validate range and constraints   
  - Ensure that data values fall within acceptable ranges or constraints.  
- Format validation  
  - Verify that the data is in the expected format or structure, which is essential for automated processing.  
- Uniqueness checks  
  - Ensure that entries that are supposed to be unique do not have duplicates.  
- Cross-validation  
  - Use related datasets or data sources to validate the data.  
- Automate validation processes  
  - Streamline validation to make it efficient and repeatable, especially for large datasets or ongoing data collection.  
- Document validation issues and resolutions  
  - Keep track of identified issues and how they were resolved for future reference and accountability.  

# Data Storing
The final step involves saving the wrangled data in a suitable storage system for easy access and analysis.  
This could be databases, data warehouses, or cloud storage solutions, depending on the scale and purpose of the data analysis project.  
- Selection of Storage Solution  
  - Choose an appropriate data storage solution that aligns with the data's nature, size, and intended use.  
- Data Modelling and Schema Design  
  - Design a logical structure for the data that supports efficient storage, query, and retrieval.  
- Normalization and Denormalization  
  - Organize the data to reduce redundancy and improve integrity in relational databases through
    normalization, or optimize performance and read efficiency through denormalization, especially in
    NoSQL databases or data warehouses.
- Data Formatting and Encoding  
  - Ensure that data is stored in a consistent and appropriate format that matches the storage system’s requirements.  
- Implementing Data Security Measures  
  - Protect data from unauthorized access and ensure privacy and compliance with data protection regulations (e.g., GDPR, HIPAA).  
- Data Indexing and Optimization  
  - Enhance the speed and efficiency of data retrieval.  
- Backup and Recovery Planning  
  - Ensure data durability and the ability to recover from data loss or corruption.  
- Data Lifecycle Management  
  - Manage the lifecycle of data from creation to deletion, aligning with data retention policies and legal requirements.  
- Documentation and Metadata Management  
  - Provide context and understanding for the data and how it’s stored, facilitating easier data management, governance, and use.  
- Monitoring and Maintenance  
  - Ensure the storage system remains efficient, scalable, and reliable as data volume grows and access patterns change.  

Any relation between ice cream consumption and heat stroke?
Any relation between ice cream consumption and illness?
Any relation between summer weather and happiness?


which data would you need for the hospital data?  

If you need to have this sampled dataset, how do you select the data sample? are there any potential issues needed to be considered?  
How to select the data samples:  
Select data samples that match the population distribution as closely as possible, in the most important attributes e.g. age, income groups.  
 
Are there any potential issues to be considered?  
How are you going to match the samples so closely across all relevant attributes?  
For smaller attribute groups, you will have a very small sample size, which is susceptible to outliers.  
-> similar country is one, what about age group? income groups? how are you going to stratify for all of them?  

does the normalization change data characteristics?   
no  
-> what if you have data ranging 100.0-100.5? they're practically the same  

# Tutorial
Pandas dataframes default index to 0,1,2... but you can manually specify their row index and column names  
```python
pd.Dataframe([[<arr1>],[<arr2>],[<arr3>]], index=[2,3,4], columns=['a','b','c','d'])  
```


In [15]:
import pandas as pd
# Sample DataFrame
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 30, 35, 40],
    'City': ['NY', 'LA', 'Chicago', 'Houston'],
    'Hobbies': "Writing"
})

df['Age after 5 years'] = df['Age'] + 5
# The DataFrame looks like this:
#     Name   Age     City
# 0  Alice    25       NY
# 1    Bob    30       LA
# 2 Charlie    35  Chicago
# 3  David    40   Houston
print(df)

print(df.iloc[0]) # Property of df: locate by index -> df[0]
print(df.iloc[0].values)

      Name  Age     City  Hobbies  Age after 5 years
0    Alice   25       NY  Writing                 30
1      Bob   30       LA  Writing                 35
2  Charlie   35  Chicago  Writing                 40
3    David   40  Houston  Writing                 45
Name                   Alice
Age                       25
City                      NY
Hobbies              Writing
Age after 5 years         30
Name: 0, dtype: object
['Alice' np.int64(25) 'NY' 'Writing' np.int64(30)]


In [30]:
df.loc[:,'Age':'Hobbies']

Unnamed: 0,Age,City,Hobbies
0,25,NY,Writing
1,30,LA,Writing
2,35,Chicago,Writing
3,40,Houston,Writing
