Course: CS/CE 457/464-L1 - Data Science
Student: Syed Asghar Abbas Zaidi (07201)
Email: saazaidi2001@gmail.com
This repository contains a comprehensive collection of homework assignments, projects, and resources from an intensive Data Science course. The course covers fundamental to advanced concepts in data analysis, machine learning, statistical modeling, and data engineering. Each assignment demonstrates practical application of data science techniques using real-world datasets and industry-standard tools.
- DS_HW1_sz07201: Data Wrangling & Cleaning
- DS_HW2_sz07201: Exploratory Data Analysis (EDA)
- DS_HW3_sz07201: Statistical Inference & Hypothesis Testing
- DS_HW4_sz07201: SQL Database Management
- DS_HW5_sz07201: NoSQL Database Concepts
- DS_HW6_sz07201: Regression Analysis
- DS_HW7_sz07201: Classification & Decision Trees
- DS_HW8_sz07201: Clustering & Unsupervised Learning
- DS_HW9_sz07201: Time Series Analysis
- DS_HW10_sz07201: Natural Language Processing (NLP)
- DS_HW11_sz07201: Deep Learning & Neural Networks
- DS_HW12_sz07201: Big Data Processing with Apache Spark
- DS_Midterm/: Midterm examination materials and solutions
- Lecture-Slides/: Course presentation materials
- Theory/: Theoretical foundations and reference materials
- Other works/: Additional projects and practice exercises
- Data cleaning techniques and best practices
- Handling missing values and outliers
- Data transformation and normalization
- Feature engineering and selection
- Statistical summaries and distributions
- Data visualization using matplotlib, seaborn
- Correlation analysis and pattern identification
- Univariate and multivariate analysis
- Descriptive and inferential statistics
- Hypothesis testing and confidence intervals
- Probability distributions and sampling
- Statistical significance testing
- SQL querying and database design
- NoSQL concepts (MongoDB, JSON)
- Data extraction and transformation
- Database optimization techniques
- Supervised Learning: Regression and classification algorithms
- Unsupervised Learning: Clustering and dimensionality reduction
- Model evaluation and validation
- Cross-validation and performance metrics
- Feature importance analysis
- Natural Language Processing: Sentiment analysis, Named Entity Recognition
- Time Series Analysis: Forecasting and trend analysis
- Deep Learning: Neural networks and computer vision
- Recommendation Systems: Content-based and collaborative filtering
- Apache Spark fundamentals
- Distributed computing concepts
- Data pipeline development
- Python (Primary language for all assignments)
- SQL (Database querying and management)
- Data Manipulation: pandas, numpy
- Visualization: matplotlib, seaborn, plotly
- Machine Learning: scikit-learn, statsmodels
- Deep Learning: TensorFlow, Keras
- Natural Language Processing: NLTK, spaCy, TextBlob
- Database: sqlite3, pymongo
- Big Data: PySpark
- Jupyter Notebooks
- Google Colab
- VS Code
- FIFA Players Data: Player statistics and performance analysis
- House Pricing Data: Real estate price prediction
- Weather Data: Time series analysis of meteorological data
- Employee Attrition: HR analytics and workforce prediction
- Anime Dataset: Content recommendation systems
- Burger King Menu: Nutritional analysis and clustering
- Airbnb Listings: Accommodation data analysis
- Admission Chance Data: Educational outcome prediction
- Iris Dataset: Classic classification problem
- Synthetic Business Data: Practice with various analytical scenarios
- Data Analysis: Statistical analysis, hypothesis testing, correlation studies
- Machine Learning: Supervised and unsupervised learning implementations
- Data Visualization: Creating meaningful charts, plots, and dashboards
- Database Management: SQL queries, database design, NoSQL concepts
- Text Analytics: Sentiment analysis, keyword extraction, topic modeling
- Deep Learning: Image classification, neural network development
- Big Data: Distributed computing and Spark applications
- Cross-validation and model validation techniques
- Feature engineering and selection strategies
- Model interpretation and explainability
- A/B testing and experimental design
- Data pipeline development and automation
This repository demonstrates comprehensive understanding of:
- Data Science Lifecycle: From data collection to model deployment
- Statistical Thinking: Proper application of statistical methods
- Machine Learning: Implementation and evaluation of various algorithms
- Data Engineering: Database design and big data processing
- Domain Knowledge: Application of data science to various industries
- Programming Proficiency: Efficient Python programming and library usage
- Communication: Clear documentation and visualization of findings
- Python 3.7+
- Jupyter Notebook or JupyterLab
- Required packages: pandas, numpy, matplotlib, seaborn, scikit-learn, etc.
- Clone the repository:
git clone https://github.com/AsgharAZ/Data-Science.git- Navigate to the desired homework directory:
cd DS_HW2_sz07201- Install required dependencies:
pip install -r requirements.txt- Open Jupyter Notebook:
jupyter notebook- Each homework directory contains a main Jupyter notebook with complete analysis
- Data files are included in respective directories
- Follow the sequential order within each notebook for optimal learning experience
This repository represents coursework completed for:
- Course Code: CS/CE 457/464-L1
- Institution: Habib University
- Academic Year: 2024
- Focus: Applied Data Science and Machine Learning
- From basic data manipulation to advanced machine learning
- Real-world datasets and practical applications
- Multiple programming paradigms and tools
- Best practices in data science methodology
- Proper model evaluation and validation
- Clean, documented, and reproducible code
- Each homework builds upon previous concepts
- Increasing complexity and sophistication
- Integration of multiple data science domains
- All assignments follow academic integrity guidelines
- Code is well-commented for educational purposes
- Multiple approaches are sometimes explored to demonstrate learning
- Real-world applications are emphasized throughout
This is an academic portfolio repository. For educational purposes, learners are encouraged to:
- Study the methodologies and approaches used
- Understand the rationale behind different techniques
- Practice similar exercises with different datasets
- Extend the analyses with additional techniques
Last Updated: October 30, 2024
Repository Status: Academic Coursework Portfolio
License: Educational Use