Data Science

Programming Language and Software	Software Links
Data Science in Python	Python
Data Science in R	R
Data Science in Excel	Excel
Data Science in Power BI	Power BI
Data Science in Tableau	Tableau

Data Science

Data Science is an interdisciplinary field that employs scientific methods, processes, algorithms, and systems to extract knowledge and insights from structured and unstructured data, including both quantitative and qualitative data. Its application spans a wide range of domains, allowing for the utilization of acquired knowledge and actionable insights from data.

This practice involves honing programming skills, as well as gaining proficiency in mathematics and statistics, with the aim of deriving meaningful insights from structured and unstructured data, such as Kaggle datasets and real-world data. It involves a step-by-step learning process in the field of data science, encompassing analytical techniques, statistics, and research methods.

The most commonly utilized methods in data science include Regression, Clustering, Visualization, Decision Trees/Rules, and Random Forest. One must also learn the data analysis process using tools such as Python, R, Excel, Power BI, and Tableau. Moreover, aspiring data scientists should aim to expand their knowledge in machine learning and deep learning, fostering a comprehensive understanding of data and its analysis.

Completed Staff Work (CSW)

Completed Staff Work, similar to data analysis, empowers decision makers to identify solutions to problems or address issues through the careful consideration of reasonable and workable alternatives.

7 Step to CSW

1. Identify, describe, or define the problems.

2. Gather or compile information about the problem.

3. Organize information for review & consideration.

4. Analyze or evaluate the information.

5. Develop, compile or generate alternatives.

6. Select or identify the solution you want to recommend based on the results of your objective analysis.

7. Develop a plan to implement the solution and the documents necessary to authorize the implementation.

Prerequisites

Python 3.5+

R 3.5.3+

Excel 2016+

Power BI

Tableau

🔷 Getting Started with Data Science 🔷

🔵 Step-by-Step to Data Science

Define Problem
Data Collection
Data Understanding
Data Analysis/Cleaning
Data Organization/Transformation
Data Validation/Anomaly Detection
Feature Engineering
Model Training
Model Evaluation/validation
Model Monitoring
Model Deployment
Data Drift/Model Drift
Reports

🔵 Three Types of Position in Data Science

Data Engineer
- Develops, constructs, tests, and maintains architectures such as databases and large-scales processing systems.
Data Analyst
- Interprets data and turns data into information which can offer ways to improve business.
- Gather information from various sources and intrepret patterns and trends.
Machine Learning Scientist
- Research and developed algorithms.
- Predictions from data with labels and features.
- Create a predictive models.

🔵 Types of Data Analysis: Techniques and Methods

Descriptive Analysis
Text Analysis
Statistical Analysis
Diagnostic Analysis
Predictive Analysis
Prescriptive Analysis

🔵 Two Types of Data

Supervised Data (Data pre-categorized or numerical)
- Classification (Predict a category)
- Regression (Predict a number)
Unsupervised Data (Data is not labeled in any way)
- Clustering (Divide by similarity)
- Dimension Reduction (Generalization) - Find hidden dependencies
- Association (Identify Sequences)

🔵 Learning about Exploratory Data Analysis

Import, read, clean, and validate
- Define Variables
  1. Y is "Dependent Variable" and goes on y-axis (the left side, vertical one) - output value
  2. X is "Independent Variable" and goes on the x-axis (the bottom, horizontal one) - input value
- Type of Data
  1. Quantitative
  - Ratio or Interval
    - Discrete and Continuous
      Discrete variables can only take certain numerical values and are counted
      Continuous variables can take any numerical value and are measured
  1. Qualitative
  - Norminal or Ordinal
    - Binary, nominal data, and ordinal data
      Categorical variables take category or label values and place an individual into one of several groups.
- Type of data measurements
  1. Nominal - names or labels variable
    For example, gender: male and female. Other examples include eye colour and hair colour.
  2. Ordinal - non-numeric concepts like satisfaction, happiness, discomfort, etc.
    For example: is rating happiness on a scale of 1-10.
  3. Interval - numeric scales in which we know both the order and the exact differences between the values
    For example: interval data is temperature, the difference in temperature between 10-20 degrees is the same as the difference in temperature between 20-30 degrees. Likert scale is type of data. Likert scale is composed of a series of four or more Likert-type items that represent similar questions combined into a single composite score/variable. Likert scale data can be analyzed as interval data, i.e. the mean is the best measure of central tendency. use means and standard deviations to describe the scale. For example, it is a rating scale, often found on survey forms, that measures how people feel about something. It includes a series of questions that you ask people to answer, and ideally 5-7 balanced responses people can choose from. It often comes with a neutral midpoint.
  4. Ratio - measurement scales
    For example: data it must have a true zero, meaning it is not possible to have negative values in ratio data. Ratio data is measurements of height be that centimetres, metres, inches or feet.
Visualize distributions
- Univariate visualization
- Bivariate visualization
- Multivariate visualization
- Dimensionality reduction
Explore relations between variables
- Descriptive statistics
- Inferential statistics
- Statistical graphics
Explore multivariate relationships
Statistical Analysis
- Cases, Variables, Types of Variables
- Matrix and Frequency Table
- Graphs and Shapes of Distributions
- Mode, Median and Mean
- Range, Interquartile Range and Box Plot
- Variance and Standard deviation
- Z-scores
- Contingency Table, Scatterplot, Pearson’s
- Basics of Regression
- Elementary Probability
- Random Variables and Probability Distributions
- Normal Distribution, Binomial Distribution & Poisson Distribution
- Hypothesis
  3 Steps:
  (1) Making an initial assumption.
  (2) Collecting evidence (data).
  (3) Based on the available evidence (data), deciding whether to reject or not reject the initial assumption.
Inferential Statistics
- Observational Studies and Experiments
- Sample and Population
- Population Distribution, Sample Distribution and Sampling Distribution
- Central Limit Theorem
- Point Estimates
- Confidence Intervals
- Introduction to Hypothesis Testing
Questions about data
- Do you have the right data for exploratory data anlaysis?
- Do you need other data?
- Do you have the right question?

🔵 Learning to be Data Science

Choose Programming Language
- Python or R
Mathematics and Linear Algebra
Big Data
Data Visualization
Data Cleaning
How to solve Problem?
Machine Learning
- Type of algorithms performs the learning
1. Supervised Learning
- Dataset has labels
- Classification
  - Binary Classification
  - Multiclass Classification
  - Multilabel Classification
- Regression
  - Linear Regression: Linear relationships between inputs and outputs
  - Logistic Regression: Probability of a binary output
1. Unsupervised Learning
- Dataset is unlabeled
1. Semi-supervised Learning
- Dataset contains labeled and unlabeled
1. Reinforcement Learning
- Learns from mistakes
- Agent take "actions" in an environment and see the "state" of environment with the features
- Excute actions in every state with different actions bring different "rewards"
- It learns "policy".

Common Machine Learning Algorithms

Linear Regression
Logistic Regression
Decision Tree
SVM
Naive Bayes
kNN
K-Means
Random Forest
Dimensionality Reduction Algorithms
Gradient Boosting algorithms

Deep Learning
- Common Library
1. TensorFlow
2. Keras
3. Theano
4. Pytorch
5. sklearn
6. Caffe
7. Apache Spark
8. Chainer

🔵 Underfitting and Overfitting

Overfitting

Overfitting - the gap between training and test error is larger.
Overfitting - the training error is smaller than test error.
Overfitting - the larger hypothesis space, there is a higher tendancy for the model to overfit the training dataset.
A model suffering from overfitting will have high variance and low bias.

Fixing Overfitting

Simplify the model (fewer parameters)
Simplify training data (fewer attributes)
Constrain the model (regularization)
Use ccross-validation
Use Early stopping
Build an ensemble
Gather more data

Underfitting

Underfitting - both the training and test error are larger.
A model suffering from underfitting will have high bias and low variance.

Fixing Underfitting

Increase model complexity (more parameters)
Increase number of features
Feature engineer should help
Un-constrain the model (no regularization)
Reduce or remove noise on the data
Train for longer

🔵 Learning to improve the Model or Prediction

Improve the "Accuracy" of Machine Learning Model

Add More Data
Add More Features
Feature Engineering
Feature Selection
Use Regularization
Multiple Alogrithms
Ensemble Methods
Cross Validation
Algorithm Tuning
Bagging or Boosting

Name		Name	Last commit message	Last commit date
Latest commit History 231 Commits
Boston_House_Price		Boston_House_Price
Breast_Cancer		Breast_Cancer
California_Department_of_Public_Health		California_Department_of_Public_Health
California_Homes		California_Homes
California_House_Price		California_House_Price
California_Medicaid_Eligibility		California_Medicaid_Eligibility
Crime_Rate_US		Crime_Rate_US
Diabetes		Diabetes
Digit_Recognizer		Digit_Recognizer
Disability		Disability
Dow_Jones_Weekly_Returns		Dow_Jones_Weekly_Returns
Global_Warmth_NASA		Global_Warmth_NASA
Healthcare		Healthcare
Heart_Disease		Heart_Disease
Home_Loan		Home_Loan
Iris		Iris
Mass Shooting		Mass Shooting
Office_Supplies		Office_Supplies
Pokemon		Pokemon
Porto Seguro Safe Driver Prediction		Porto Seguro Safe Driver Prediction
Salem_Witchcraft		Salem_Witchcraft
Titanic		Titanic
US_Companies		US_Companies
DataScienceExcel.PNG		DataScienceExcel.PNG
DataSciencePowerBI.PNG		DataSciencePowerBI.PNG
DataSciencePython.PNG		DataSciencePython.PNG
DataScienceinR.PNG		DataScienceinR.PNG
DataSciencetableau.PNG		DataSciencetableau.PNG
LICENSE		LICENSE
README.md		README.md
Title.PNG		Title.PNG
TitleBI.PNG		TitleBI.PNG
TitleExcel.PNG		TitleExcel.PNG
TitleR.PNG		TitleR.PNG

License

LastAncientOne/Data-Science

Folders and files

Latest commit

History

Repository files navigation

Data Science

Completed Staff Work (CSW)

7 Step to CSW

1. Identify, describe, or define the problems.

2. Gather or compile information about the problem.

3. Organize information for review & consideration.

4. Analyze or evaluate the information.

5. Develop, compile or generate alternatives.

6. Select or identify the solution you want to recommend based on the results of your objective analysis.

7. Develop a plan to implement the solution and the documents necessary to authorize the implementation.

Prerequisites

Python 3.5+

R 3.5.3+

Excel 2016+

Power BI

Tableau

🔷 Getting Started with Data Science 🔷

🔵 Step-by-Step to Data Science

🔵 Three Types of Position in Data Science

🔵 Types of Data Analysis: Techniques and Methods

🔵 Two Types of Data

🔵 Learning about Exploratory Data Analysis

🔵 Learning to be Data Science

🔵 Underfitting and Overfitting

Overfitting

Fixing Overfitting

Underfitting

Fixing Underfitting

🔵 Learning to improve the Model or Prediction

Author:

Tin Hang

About

Topics

Resources

License

Stars

Watchers

Forks

Languages