Skip to content

AsgharAZ/Data-Science

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

33 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Data Science Course Repository

Course: CS/CE 457/464-L1 - Data Science
Student: Syed Asghar Abbas Zaidi (07201)
Email: saazaidi2001@gmail.com

πŸ“‹ Project Overview

This repository contains a comprehensive collection of homework assignments, projects, and resources from an intensive Data Science course. The course covers fundamental to advanced concepts in data analysis, machine learning, statistical modeling, and data engineering. Each assignment demonstrates practical application of data science techniques using real-world datasets and industry-standard tools.

πŸ—‚οΈ Repository Structure

Homework Assignments

  • DS_HW1_sz07201: Data Wrangling & Cleaning
  • DS_HW2_sz07201: Exploratory Data Analysis (EDA)
  • DS_HW3_sz07201: Statistical Inference & Hypothesis Testing
  • DS_HW4_sz07201: SQL Database Management
  • DS_HW5_sz07201: NoSQL Database Concepts
  • DS_HW6_sz07201: Regression Analysis
  • DS_HW7_sz07201: Classification & Decision Trees
  • DS_HW8_sz07201: Clustering & Unsupervised Learning
  • DS_HW9_sz07201: Time Series Analysis
  • DS_HW10_sz07201: Natural Language Processing (NLP)
  • DS_HW11_sz07201: Deep Learning & Neural Networks
  • DS_HW12_sz07201: Big Data Processing with Apache Spark

Additional Resources

  • DS_Midterm/: Midterm examination materials and solutions
  • Lecture-Slides/: Course presentation materials
  • Theory/: Theoretical foundations and reference materials
  • Other works/: Additional projects and practice exercises

🧠 Key Topics Covered

1. Data Preprocessing & Wrangling

  • Data cleaning techniques and best practices
  • Handling missing values and outliers
  • Data transformation and normalization
  • Feature engineering and selection

2. Exploratory Data Analysis (EDA)

  • Statistical summaries and distributions
  • Data visualization using matplotlib, seaborn
  • Correlation analysis and pattern identification
  • Univariate and multivariate analysis

3. Statistical Methods

  • Descriptive and inferential statistics
  • Hypothesis testing and confidence intervals
  • Probability distributions and sampling
  • Statistical significance testing

4. Database Management

  • SQL querying and database design
  • NoSQL concepts (MongoDB, JSON)
  • Data extraction and transformation
  • Database optimization techniques

5. Machine Learning

  • Supervised Learning: Regression and classification algorithms
  • Unsupervised Learning: Clustering and dimensionality reduction
  • Model evaluation and validation
  • Cross-validation and performance metrics
  • Feature importance analysis

6. Advanced Analytics

  • Natural Language Processing: Sentiment analysis, Named Entity Recognition
  • Time Series Analysis: Forecasting and trend analysis
  • Deep Learning: Neural networks and computer vision
  • Recommendation Systems: Content-based and collaborative filtering

7. Big Data Technologies

  • Apache Spark fundamentals
  • Distributed computing concepts
  • Data pipeline development

πŸ› οΈ Technologies & Tools

Programming Languages

  • Python (Primary language for all assignments)
  • SQL (Database querying and management)

Key Libraries

  • Data Manipulation: pandas, numpy
  • Visualization: matplotlib, seaborn, plotly
  • Machine Learning: scikit-learn, statsmodels
  • Deep Learning: TensorFlow, Keras
  • Natural Language Processing: NLTK, spaCy, TextBlob
  • Database: sqlite3, pymongo
  • Big Data: PySpark

Development Environment

  • Jupyter Notebooks
  • Google Colab
  • VS Code

πŸ“Š Datasets Used

Real-world Datasets

  • FIFA Players Data: Player statistics and performance analysis
  • House Pricing Data: Real estate price prediction
  • Weather Data: Time series analysis of meteorological data
  • Employee Attrition: HR analytics and workforce prediction
  • Anime Dataset: Content recommendation systems
  • Burger King Menu: Nutritional analysis and clustering
  • Airbnb Listings: Accommodation data analysis
  • Admission Chance Data: Educational outcome prediction

Synthetic Datasets

  • Iris Dataset: Classic classification problem
  • Synthetic Business Data: Practice with various analytical scenarios

πŸ“ˆ Key Skills Demonstrated

Technical Skills

  • Data Analysis: Statistical analysis, hypothesis testing, correlation studies
  • Machine Learning: Supervised and unsupervised learning implementations
  • Data Visualization: Creating meaningful charts, plots, and dashboards
  • Database Management: SQL queries, database design, NoSQL concepts
  • Text Analytics: Sentiment analysis, keyword extraction, topic modeling
  • Deep Learning: Image classification, neural network development
  • Big Data: Distributed computing and Spark applications

Methodologies

  • Cross-validation and model validation techniques
  • Feature engineering and selection strategies
  • Model interpretation and explainability
  • A/B testing and experimental design
  • Data pipeline development and automation

🎯 Learning Outcomes

This repository demonstrates comprehensive understanding of:

  1. Data Science Lifecycle: From data collection to model deployment
  2. Statistical Thinking: Proper application of statistical methods
  3. Machine Learning: Implementation and evaluation of various algorithms
  4. Data Engineering: Database design and big data processing
  5. Domain Knowledge: Application of data science to various industries
  6. Programming Proficiency: Efficient Python programming and library usage
  7. Communication: Clear documentation and visualization of findings

πŸš€ How to Use This Repository

Prerequisites

  • Python 3.7+
  • Jupyter Notebook or JupyterLab
  • Required packages: pandas, numpy, matplotlib, seaborn, scikit-learn, etc.

Installation

  1. Clone the repository:
git clone https://github.com/AsgharAZ/Data-Science.git
  1. Navigate to the desired homework directory:
cd DS_HW2_sz07201
  1. Install required dependencies:
pip install -r requirements.txt
  1. Open Jupyter Notebook:
jupyter notebook

Running the Notebooks

  • Each homework directory contains a main Jupyter notebook with complete analysis
  • Data files are included in respective directories
  • Follow the sequential order within each notebook for optimal learning experience

πŸ“š Academic Context

This repository represents coursework completed for:

  • Course Code: CS/CE 457/464-L1
  • Institution: Habib University
  • Academic Year: 2024
  • Focus: Applied Data Science and Machine Learning

πŸ” Key Features

Comprehensive Coverage

  • From basic data manipulation to advanced machine learning
  • Real-world datasets and practical applications
  • Multiple programming paradigms and tools

Industry Standards

  • Best practices in data science methodology
  • Proper model evaluation and validation
  • Clean, documented, and reproducible code

Progressive Learning

  • Each homework builds upon previous concepts
  • Increasing complexity and sophistication
  • Integration of multiple data science domains

πŸ“ Notes

  • All assignments follow academic integrity guidelines
  • Code is well-commented for educational purposes
  • Multiple approaches are sometimes explored to demonstrate learning
  • Real-world applications are emphasized throughout

🀝 Contributing

This is an academic portfolio repository. For educational purposes, learners are encouraged to:

  • Study the methodologies and approaches used
  • Understand the rationale behind different techniques
  • Practice similar exercises with different datasets
  • Extend the analyses with additional techniques

Last Updated: October 30, 2024
Repository Status: Academic Coursework Portfolio
License: Educational Use

About

This repository includes projects and resources from my Data Science course in Python. It covers data analysis techniques like cleaning, EDA, statistical modeling, machine learning, and visualization. I use libraries like Pandas, NumPy, Matplotlib, and MongoDB to analyze datasets and build predictive models.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors