# Data Analysis with Python - IBM Course Notes

## Course Overview
This notebook contains comprehensive notes from the IBM Data Analysis with Python course on Coursera. It covers fundamental concepts and practical implementations of data analysis using Python's popular libraries.

**Course Link:** [Data Analysis with Python on Coursera](https://www.coursera.org/learn/data-analysis-with-python/)

## Course Structure

| Module | Topic | Duration |
|--------|--------|----------|
| 1 | Importing Data Sets | 1 hour |
| 2 | Data Wrangling | 1 hour |
| 3 | Exploratory Data Analysis | 2 hours |
| 4 | Model Development | 2 hours |
| 5 | Model Evaluation and Refinement | 2 hours |
| 6 | Final Assignment | 4 hours |

## Module 1: Importing Data Sets

### Python Libraries for Data Analysis

#### Scientific Computing Libraries
- **NumPy**: Fundamental package for scientific computing with Python
  - Arrays and matrices
  - Mathematical functions
  - Random number capabilities

- **Pandas**: Data manipulation and analysis library
  - DataFrame and Series objects
  - Data importing and exporting
  - Data alignment and integration

- **SciPy**: Advanced computing library
  - Scientific and technical computing
  - Optimization and linear algebra
  - Signal and image processing

#### Visualization Libraries
- **Matplotlib**: Basic plotting library
  - Create static, animated, and interactive visualizations
  - Extensive customization options

- **Seaborn**: Statistical data visualization
  - Built on top of Matplotlib
  - Higher-level interface for statistical graphics

- **Plotly**: Interactive visualization library
  - Web-based plotting
  - Interactive features
  - Support for various chart types

#### Machine Learning Libraries
- **Scikit-learn**: Machine learning library
  - Classification, regression, clustering
  - Model selection and preprocessing

- **StatsModels**: Statistical modeling
  - Estimation of statistical models
  - Statistical tests
  - Statistical data exploration

- **TensorFlow**: Deep learning framework
  - Neural networks
  - Deep learning models
  - GPU acceleration

In [None]:
# Import essential libraries
import numpy as np  # for numerical operations
import pandas as pd  # for data manipulation
import matplotlib.pyplot as plt  # for visualization
import seaborn as sns  # for statistical visualizations

# Set style for better visualizations
plt.style.use('seaborn')
sns.set_palette('husl')

### Data Importing Techniques

#### Supported File Formats
- CSV (.csv)
- JSON (.json)
- Excel (.xlsx, .xls)
- Text files (.txt)
- HDF5 (.h5, .hdf5)
- SQL databases
- Web APIs

#### File Path Types
1. Local files: `/path/to/file/data.csv`
2. URLs: `https://example.com/data.csv`
3. Database connections

#### Common Import Patterns

In [None]:
# CSV Import Examples

# Basic CSV import
df_basic = pd.read_csv('data.csv')

# CSV with specific options
df_advanced = pd.read_csv('data.csv',
                         header=0,  # use first row as headers
                         index_col=0,  # use first column as index
                         parse_dates=['date_column'],  # parse date columns
                         na_values=['NA', '?']  # custom NA values
                        )

# CSV without headers
df_no_header = pd.read_csv('data.csv',
                          header=None,
                          names=['col1', 'col2', 'col3']  # custom column names
                         )

### Data Exploration Methods

#### Basic DataFrame Information

In [None]:
# Assuming we have a DataFrame 'df'

# View first few rows
print("First 5 rows:")
df.head()

# View last few rows
print("\nLast 5 rows:")
df.tail()

# Get DataFrame info
print("\nDataFrame information:")
df.info()

# Get statistical summary
print("\nStatistical summary:")
df.describe()

# Get statistical summary for all columns including non-numeric
print("\nComplete statistical summary:")
df.describe(include='all')

### Handling Missing Data

#### Strategies for Missing Data

1. **Investigation**
   - Check with data source
   - Understand why data is missing
   - Determine if missing pattern is random

2. **Removal Options**
   - Drop entire variables (columns)
   - Drop specific observations (rows)
   - Consider impact on sample size

3. **Replacement Options**
   - Mean/median imputation
   - Mode imputation for categorical data
   - Predictive imputation
   - Forward/backward fill

4. **Keep as Missing**
   - Use algorithms that handle missing values
   - Create missing value indicators

In [None]:
# Missing Data Handling Examples

# Check missing values
print("Missing values per column:")
df.isnull().sum()

# Drop rows with any missing values
df_cleaned = df.dropna()

# Drop rows with missing values in specific columns
df_cleaned_subset = df.dropna(subset=['important_column'])

# Fill missing values with mean
df_filled = df.fillna(df.mean())

# Fill missing values with different strategies per column
df_filled_custom = df.fillna({
    'numeric_column': df['numeric_column'].mean(),
    'categorical_column': 'unknown',
    'datetime_column': df['datetime_column'].ffill()
})

### Database Connectivity

#### SQL Database Connection Example

In [None]:
# Example using SQLAlchemy (recommended approach)
from sqlalchemy import create_engine

def connect_to_database(database_url):
    """
    Create a database connection using SQLAlchemy.
    
    Parameters:
    database_url (str): Database connection URL
    
    Returns:
    engine: SQLAlchemy engine object
    """
    try:
        engine = create_engine(database_url)
        return engine
    except Exception as e:
        print(f"Error connecting to database: {e}")
        return None

def query_database(engine, query):
    """
    Execute SQL query and return results as DataFrame.
    
    Parameters:
    engine: SQLAlchemy engine object
    query (str): SQL query to execute
    
    Returns:
    DataFrame: Query results
    """
    try:
        return pd.read_sql(query, engine)
    except Exception as e:
        print(f"Error executing query: {e}")
        return None

### Data Export Methods

In [None]:
# Export examples

# Export to CSV
df.to_csv('exported_data.csv', index=False)

# Export to Excel
df.to_excel('exported_data.xlsx', sheet_name='Sheet1')

# Export to JSON
df.to_json('exported_data.json')

# Export to SQL database
df.to_sql('table_name', engine, if_exists='replace', index=False)

## Additional Resources

### Hands-on Labs
1. [Importing Data Sets Lab 1](/labs/1_DA0101EN-Review-Introduction.ipynb)
2. [Importing Data Sets Lab 2](/labs/2_Practice_data_loading.ipynb)
3. [Data Wrangling Lab](/labs/3_DA0101EN-Review-Data-Wrangling.ipynb)

### Further Reading
- [Pandas Documentation](https://pandas.pydata.org/docs/)
- [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/)
- [Real Python - Working with CSV Files](https://realpython.com/python-csv/)